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ABSTRACT 


THE  COMPLEXITY  OF  LEARNING  FROM  A  MIXTURE 
OF  LABELED  AND  UNLABELED  EXAMPLES 

Joel  E.  Ratsaby 
Santosh  S.  Venkatesh 

The  learning  of  a  pattern  classification  rule  rests  on  acquiring  information  to 
constitute  a  decision  rule  that  is  close  to  the  optimal  Bayes  rule.  Among  the  various 
ways  of  conveying  information,  showing  the  learner  examples  from  the  different  classes 
is  an  obvious  approach  and  ubiquitous  in  the  pattern  recognition  field.  Basically  there 
are  two  types  of  examples:  labeled  in  which  the  learner  is  provided  with  the  correct 
classification  of  the  example  and  unlabeled  in  which  this  classification  is  missing. 
Driven  by  the  reality  that  often  unlabeled  examples  are  plentiful  whereas  labeled 
examples  are  difficult  or  expensive  to  acquire  we  explore  the  tradeoff  between  labeled 
and  unlabeled  sample  complexities  (the  number  of  examples  required  to  learn  to 
within  a  specified  error),  specifically  getting  a  quantitative  measure  of  the  reduction 
in  the  labeled  sample  complexity  as  a  result  of  introducing  unlabeled  examples.  This 
problem  was  posed  in  this  form  by  T.  M.  Cover  and  may  be  succinctly,  if  inexactly, 
stated  as  follows:  How  many  unlabeled  examples  is  one  labeled  example  worth? 

The  direction  taken  in  this  dissertation  focuses  on  the  archetypal  problem  of 
learning  a  classification  problem  with  two  pattern  classes  that  are  typified  by  fea¬ 
ture  vectors,  i.e.,  examples  drawn  from  class  conditional  Gaussian  distributions  and 
where  the  learning  approaches  are  parametric  and  nonparametric.  Denoting  the  di¬ 
mensionality  of  the  example-space  as  N,  and  the  number  of  labeled  and  unlabeled 
examples  as  m  and  n  respectively,  then  for  specific  algorithms,  it  is  shown  that  un¬ 
der  a  nonparametric  scenario  the  classification  error  probability  decreases  roughly 


as  O  (Jcq n~2^N^N^  +0  (e~Cim),  and  in  the  parametric  scenario  the  error  decreases 
roughly  as  O  (iV3/5n-1/5)  +(3  (e_cim),  where  co,ci  >0  are  constants  with  respect  to 
N,  m,  and  n.  This  shows  that  in  both  the  parametric  and  nonparametric  cases  it 
takes  roughly  exponentially  more  unlabeled  examples  than  labeled  examples  for  the 
same  reduction  in  error.  When  considering  the  effect  of  the  dimensionality  N ,  roughly 
speaking,  a  labeled  example  is  worth  exponentially  more  in  the  nonparametric  than 
in  the  parametric  scenario. 

The  parametric  approach  uses  the  Maximum  Likelihood  technique  with  labeled 
and  unlabeled  samples  to  construct  a  decision  rule  estimate.  In  this  scenario  the 
learner  knows  the  parametric  form  of  the  pattern  class  densities.  Sufficient  finite 
sample  complexities  are  established  by  which  the  value  of  one  labeled  example  in 
terms  of  the  number  of  unlabeled  examples  is  determined  to  be  polynomial  in  the 
dimensionality  N.  The  analysis  may  provide  the  details  for  broadening  the  results  to 
other  non  Gaussian  parametric  based  families  of  problems.  An  extension  to  the  case 
of  different  a  priori  class  probabilities  is  investigated  under  this  parametric  scenario, 
and  for  the  non-unit  covariance  Gaussian  problem  it  is  conjectured  that  the  value  of 
a  labeled  example  is  still  polynomial  in  N. 

In  the  nonparametric,  scenario  the  primary  focus  is  on  an  algorithm  which  is 
based  on  Kernel  Density  Estimation.  It  uses  a  mixed  sample  to  construct  a  decision 
rule  where  now  the  learner  has  significantly  less  side  information  about  the  class  den¬ 
sities.  The  finite  sample  complexities  for  learning  the  Gaussian  based  problem  are 
established  by  which  the  value  of  one  labeled  example  is  determined  to  be  exponential 
in  the  dimensionality  N.  An  extension  to  a  larger  family  of  nonparametric  classifica¬ 
tion  problems  is  provided  where  the  same  tradeoff  applies.  A  variant  of  this  approach 
is  investigated  in  which  only  a  finite  number  of  functional  values  of  the  underlying 


vi 


mixture  density  are  estimated.  This  yields  a  smaller  tradeoff  but  is  still  exponential 
in  N.  The  mixed  sample  complexities  for  the  classical  A:-means  clustering  procedure 
are  also  determined. 

An  experimental  investigation  using  neural  networks  examines  the  value  of  a 
labeled  example  when  learning  a  classification  problem  based  on  a  Gaussian  mixture. 
For  other  classification  problems,  the  cost  of  learning  measured  by  the  labeled  sample 
size  as  a  function  of  the  dimensionality  N,  is  shown  to  be  lower  for  a  two-layer  network 
than  with  the  regular  single  layer  Kohonen  network.  This  is  attributed  to  the  better 
discrimination  ability  of  the  partition  of  the  classifier. 
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Chapter  1 
Introduction 


The  problem  of  learning  a  classification  decision  rule  (cf.  Duda  &z  Hart  [1],  Fukunaga 
[2])  has  been  the  subject  of  a  large,  diverse  body  of  literature  spanning  at  least  the 
last  60  years.  It  has  been  approached  by  many  methods  in  statistics  and  pattern 
recognition.  In  its  basic  form,  we  are  given  two  classes,  “1”  and  “2”,  of  patterns  that 
are  represented  by  vectors  whose  elements  represent  various  features  about  each  of  the 
two  patterns.  For  instance,  in  the  medical  diagnosis  of  cancer,  the  two  pattern  classes 
are  “malignant”  and  “benign”  cells.  A  cell  in  a  class  has  a  variety  of  features  such  as 
size,  color,  shape,  genetic  code,  etc.,  by  which  it  is  described.  For  our  purposes,  we 
assume  the  features  are  represented  by  a  point  x  in  N- dimensional  Euclidean  space. 
The  objective  is  to  find  a  decision  rule  which  when  presented  with  a  pattern  (i.e.,  a 
vector)  that  is  drawn  randomly  either  from  class  “1”  (with  probability  pi)  or  class 
“2”  (with  probability  p2),  produces  a  label  which  identifies  it  as  belonging  to  the  true 
class  of  origin.  Ideally  we  would  desire  a  rule  that  never  misclassifies  a  pattern.  This 
however  is  only  achievable  if  the  pattern  classes  have  non-overlapping  probability 
one  supports;  in  general  the  best  achievable  rule  (the  Bayes  classifier)  has  a  nonzero 
misclassification  error,  PBayesi  determined  by  the  class  conditional  probability  density 
functions  fi(x),  fi{x)  and  the  a  priori  class  probabilities  pi  and  p2. 

The  Bayes  decision  rule  is  derived  from  the  following:  let  with  i,j  €  1,2, 
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denote  the  loss  incurred  when  the  classifier  decides  “t”  while  the  true  class  of  the 
pattern  is  “j” .  We  limit  our  discussion  to  the  case  of  the  symmetric  0-1  loss  function: 
/(l,  1)  =  1(2, 2)  =  0  and  1(1, 2)  =  1(2, 1)  =  1  for  which  the  expected  loss  is  identically 
the  probability  of  misclassification,  Perror  ■  In  this  case  we  have 

Perr0T  =  E l(i,j)  =  ExE(l(i,j)\x) . 


The  inner  expectation  is 

E(l(i,j)\x)  =  P(i  =  l,j  =  2\x)  +  P(*  =  2,j  =  l|s) 

=  P(i  =  1|  j  =  2  ,x)p(j  =  2|x)  +  P(i  =  2\j  =  1  ,x)p(j  =  l|x) 

where  p(j  =  l|x),  p(j  =  2|x)  are  the  a  posterior  class  probabilities.  This  expectation 
is  a  nonnegative  quantity  hence  in  order  to  minimize  PeTr0r  it  suffices  to  specify  a 
classification  rule  which  minimizes  it. 

A  classifier  can  be  considered  as  a  mapping 

C  :  1Rn  ^  {1,2} 


or  a  partition  of  the  feature  space  into  disjoint  regions  Ru  P.2,  where 


if  x  G  Ri, 
if  x  €  -R-2- 


This  is  a  deterministic  rule  hence  we  have 


P(*  =  l|j  =  2,x)  =  lfi,(*) 

and 

P(t  =  2|j  =  l,.x)  =  li?2(x) 


where  we  use  the  notation  l^(x)  to  denote  the  indicator  function  for  the  set  A,  i.e., 


1^(0;)  =  1  if  x  6  A  and  IaO^)  =  0  if  x  £  A. 


The  optimal  (or  Bayes)  classifier  is  one  which  minimizes 

p(j  =  l|s)l*€ik  +P(J  = 

Only  the  decision  regions  R\,  R2  are  controllable  and  it  is  clear  that  the  minimizing 
choice  is 

Ri  =  {x  :  p(j  =  2\x)  <  p(j  =  1|*)>  and  R2  =  {x  :  p(j  =  l|x)  <  p(j  =  2|x)}. 

The  decision  border  is 

{x  :  p(j  =  l|x)  =  p(j  =  2|*)}  =  {x  :  pxh{x)  =  p2f2(x)},  (l-1) 

where  the  last  equality  follows  from  Bayes’  theorem,  fi{x)pi  =.p(j  =  l\x)f(x),  with 
fi(x)  being  the  class  conditional  densities,  and  p,  are  the  a  priori  class  probabilities, 
i  =  1,2. 

Hence  f1(x),f2(x)  and  pi,p2  determine  the  Bayes  decision  rule  and  the  resulting 
(minimum)  error  of  the  Bayes  classifier  is  zero  if  and  only  if  the  pattern  classes  have 
disjoint  probability  one  supports. 

If  the  class  conditional  densities  and  the  priors  were  known,  we  can  hence  de¬ 
termine  the  Bayes  optimal  decision  rule  with  Perror  =  PBayes •  However,  realistically, 
this  is  a  rare  occurrence;  as  in  the  above  medical  diagnosis  example,  such  detailed 
prior  information  is  usually  not  available.  We  can  at  best  hope  for  partial  informa¬ 
tion  about  the  classes,  a  typical  scenario  providing  randomly  drawn  data  according  to 
the  unknown  probability  distributions.  This  will  be  our  focus  here.  Using  a  random 
sample,  our  goal  is  to  determine  a  rule  that  achieves  a  given  error  probability  which 
is  not  much  larger  than  Psayes-  More  precisely,  for  e  >  0  chosen  suitable  small,  we 
would  like  to  obtain  Perror  bounded  between  PBayes  and  Psaj/es(l  +  e)- 

Broadly  speaking,  the  approach  to  classifier  design  is  to  use  randomly  drawn 
examples  to  estimate  the  class  conditional  densities  and  plug  them  into  the  above 
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expression  that  relates  the  densities  with  the  decision  regions.  The  resulting  rule 
has  a  classification  error  Perr0r  which  may  differ  from  the  optimal  PjBayes-  This  ap¬ 
proach  can  be  validated,  at  least  asymptotically  in  the  limit  of  large  sample  sizes.  For 
example,  it  is  shown  in  Glick  [38]  that  sample-based  density  plug-in  rules  are  asymp¬ 
totically  optimal  i.e.,  minimize  the  classification  error  when  the  density  estimates  are 
themselves  densities  and  are  strongly  consistent. 

Classification  methods  vary  according  to  the  type  and  amount  of  additional  side- 
information  that  is  available.  Direct  information  about  the  class  densities  leads  to 
an  estimate  of  the  likelihood  ratio  and  hence  of  the  optimal  decision  border.  More 
typically,  only  partial  information  is  accessible;  for  instance:  the  parametric  form  of 
the  distributions  but  not  the  parameter  value;  knowledge  that  the  distributions  are 
monotone  decreasing;  or  that  the  mixture  (i.e.,  weighted  sum  of  the  class  conditionals) 
has  k  modes  (peaks). 

Traditionally,  in  the  fields  of  statistics  and  pattern  recognition,  there  are  two 
main  categories  for  density  estimation:  parametric  and  non-parametric.  These  are 
divided  into  various  branches  based  upon  the  estimation  method  which  depends  on 
the  information  that  is  provided  (or  assumed)  about  the  classes;  for  instance,  if  it 
is  known  that  the  class  densities  are  of  a  given  parametric  form  then  the  method  of 
maximum-likelihood  can  be  invoked.  Once  a  density-estimation  method  is  chosen,  it 
remains  to  learn  the  constraints  in  the  observed  data  and  deduce  the  density  that  is 
closest  (w.r.t.  some  quantitative  measure)  to  the  true  underlying  class  densities. 

If  information  regarding  the  densities  is  not  available  then  one  must  resort  to 
assumptions  or  heuristics  based  on  some  rules  of  thumb,  in  order  to  construct  a 
decision  border  that  hopefully  has  low  PerTor •  For  instance,  observed  data  can  be 
tested  for  clusters  and  a  partition  of  the  feature  space  is  constructed  such  that  each 
cluster  is  captured  by  one  disjoint  subset  (a  cell)  of  the  partition.  Then  each  cell  gets 
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associated  with  the  class  corresponding  to  the  class  of  the  majority  of  the  observed 
examples.  This  induces  a  decision  rule  which  may  have  low  error.  Neural  network 
algorithms  as  applied  to  learning  classification,  are  one  of  numerous  ad  hoc  methods 
that  fall  under  this  category. 

1.1  Classification  Methodologies 

The  following  is  a  nonexhaustive  list  of  a  few  popular  classification  methodologies  (cf. 
Fukunaga  [2],  Duda  &  Hart  [1],  Izenman  [3]  ): 

Parametric  density  estimation:  We  are  restricted  to  a  class  of  parametric  density 
functions,  f(x\9),  with  known  form  and  unknown  true  parameter  90. 

Maximum  Likelihood  Estimation  ( MLE '):  The  parameter  is  viewed  as  a  deter¬ 
ministic  variable,  and  one  solves  for  the  value  of  9  that  achieves  the  global 
maximum  of  the  likelihood  function  L(0)  =  £  £"=1  l°g  /(*«|0)>  where  {z,, 
1  <  i  <  n}  is  a  set  of  examples  drawn  independently  and  distributed  ac¬ 
cording  to  f(x\90).  This  value  of  9  is  defined  as  the  estimator  9.  Defining 
the  decision  regions  as  in  (1.1)  using  the  estimates  f(x\9i)  and  f(x |02)  for 
the  two  class  densities  yields  an  estimate  of  the  Bayes  classifier. 

Maximum  likelihood  parameter  estimates  typically  exhibit  optimal  proper¬ 
ties  (cf.  Bickel  &  Doksum  [39]).  They  are  often  asymptotically  consistent, 
i.e.,  converge  to  the  true  unknown  parameter  as  the  sample  size  increases 
and  are  asymptotically  efficient,  i.e.  the  rate  of  decrease  of  their  variance 
converges  to  the  Cramer- Rao  lower  bound.  Hence  a  decision  rule  based  on 
the  MLE-density  estimates  are,  at  least  theoretically,  attractive  for  solving 
a  classification  problem. 
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Bayesian  Estimation :  We  seek  the  distribution  of  x  given  the  random  sample 
xn  —  (xi, . . . ,  xn),  i.e.,  f(x\xn).  The  parameter  9  is  viewed  as  a  random 
variable  with  an  a  priori  distribution  h{6).  Any  side  information  we  might 
have  about  the  unknown  parameter  is  assumed  to  be  contained  in  this 
distribution.  We  can  write 

f(x\xn)  =  J  f(x\xn,9)g(9\xn)d9  =  J  f(x\9)g(9\xn)d9 

because  x  is  independent  of  the  sample  xn.  By  assumption,  f(x\9 )  is 
known,  hence  the  desired  density  is  the  expected  value  of  f(x\9)  w.r.t.  the 
possible  values  of  9  based  on  the  random  sample  xn. 

Learning  involves  updating  the  a  posterior  distribution  f(9\xn),  whose 
variance  decreases  as  the  number  of  examples  increases,  whereby  the  in¬ 
tegral  on  the  right  tends  to  f(x\90).  For  instance  in  learning  the  mean  of 
a  Gaussian  distributed  random  variable,  the  variance  of  the  estimator  9  is 
asymptotic  to  cr2 /n  as  n  — *  oo. 

Moment  Estimation:  The  parameter  vector  9  is  composed  of  the  moments  m,-, 
1  <  i  <  k,  of  the  true  distribution.  These  are  estimated  by  the  empirical 
(sample)  moments  rhi  =  ^  Z!"=i x)  ■  These  estimates  are  consistent  and 
yield  consistent  density  estimates. 

N on-parametric  density  estimation :  Very  little  information  is  available.  Neither  the 
form  of  the  class  conditional  distributions  nor  any  of  the  parameters  if  such 
exist,  are  known. 

Kernel  Estimation:  A  function  KantXi(x),  called  the  Kernel,  is  placed  centered 
at  each  example  xn  i.e., 

/c, „*,(*)  =  K 

\  <Tn  / 
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where  an  is  a  smoothing  parameter.  The  smoothed  average  of  the  func¬ 
tions,  each  centered  at  one  of  the  n  examples,  forms  the  density  estimate 


'■'■’■drS''  fir>- 


where  N  denotes  the  dimension.  The  bias  of  the  estimate  decreases  as 


the  smoothing  parameter  an  — *  0.  However,  the  rate  of  decrease  of  the 
variance  of  the  estimate  as  n  — >  oo,  becomes  worse,  i.e.,  slower  as  a  — 0. 


By  selecting  crn  to  decrease  to  zero  at  the  right  rate,  it  is  possible  for  /n(x) 
to  be  strongly  consistent,  uniformly  for  all  x  €  IR^  (cf.  Pollard  [21]). 


The  shape  of  the  kernel  function  can  be  designed  to  accelerate  the  decrease 
in  the  bias  of  the  estimate  as  n  — *•  oo  (cf.  Izenman  [3],  Silverman  [40]). 
For  learning  classification,  it  may  not  be  necessary  for  fn(x )  to  be  a  bona 
fide  pdf  as  for  instance  in  our  investigation  in  Chapter  5.  There  the  modes 
of  the  mixture  density  can  directly  determine  the  Bayes  decision  regions, 
and  may  be  estimated  by  the  modes  of  a  kernel  estimate  which  takes  also 
negative  values.  With  such  a  kernel,  better  rates  of  decrease  for  the  bias 
are  achievable. 


Histogram  Methods :  The  histogram  method  is  an  old  and  basic  approach  to 
density  estimation.  The  feature  space  is  divided  into  cells,  c,-  C  IR^l  <  i  < 
M,  and  the  density  function  is  approximated  by  the  number  of  examples 
that  fall  in  each  cell.  In  a  one  dimensional  sample  space  the  estimate  is 

1  M 

f(x)  = - ^2ndx€ci, 

N&n  i=\ 

where  n,-  is  the  number  of  examples  in  c,,  and  an  is  the  cell  width.  The 
histogram  density  estimate  is  suboptimal  and  its  defects  include  the  dis¬ 
continuity  at  cell  boundaries  and  its  strong  sensitivity  to  the  location  of 
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the  origin,  i.e.,  shifting  the  starting  point  of  the  first  cell  can  result  in  very- 
different  looking  histograms.  For  optimal  error  rates,  the  cell  width  needs 
to  decrease  slower  than  n-1  as  n  — ►  oo.  But  even  its  optimal  error  rate  is 
substantially  slower  than  most  other  kinds  of  density  estimators. 


Direct  classification  approaches :  These  methods  do  not  estimate  the  class  conditional 
densities  but  instead  directly  construct  a  decision  rule  using  the  randomly  drawn 
examples. 


Nearest  Neighbor  Rule:  A  partition  of  the  feature  space  is  constructed  by 
drawing  m  labeled  examples,  x;,  1  <  i  <  m,  and  defining  a  voronoi  cell  of 
the  partition  to  be 

Vi  =  {x  :  \x  -  Xi\  <\x-  Xj\,j  ±  z'}. 

Each  voronoi  cell  is  assigned  the  label  of  the  example  x,  corresponding  to 
it.  The  decision  rule  for  classifying  a  given  x  is  to  assign  to  it  the  label  of 
the  voronoi  cell  in  which  it  falls. 

The  nearest  neighbor  decision  rule  is  suboptimal  since  even  as  the  number 
of  labeled  examples  tends  to  infinity  its  classification  error  need  not  tend 
to  the  Bayes  optimal  error.  However  its  error  is  bounded  (tightly)  as 


P Bayes  Perror,NN  <  2F Bayes  ( 1  ~  PBay  es)’ 


The  simplicity  of  this  rule,  its  near-optimal  performance  for  small  P Bayes 
( Perror,NN  is  upper  bounded  by  twice  the  Bayes  error  for  large  sample 
sizes)  and  the  fact  that  it  is  based  on  a  Voronoi  partition  whose  efficient 
implementation  has  been  studied,  are  significant  advantages. 
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Clustering  Procedures :  These  methods  aim  to  discover  inherent  clustering  in 
the  given  sample  of  patterns,  which  possess  strong  feature-similarity.  Each 
cluster  constitutes  one  of  the  mutually  disjoint  decision  regions  of  the  clas¬ 
sifier.  The  performance  of  the  classifier  depends  on  the  metric  that  is  used 
to  measure  similarity  between  examples.  Different  problems,  with  different 
class  densities,  may  require  different  similarity  measures  for  good  classifier 
performance. 

One  way  of  learning  the  partition  is  to  choose  any  one  which  extremizes  a 
criterion  function.  For  instance  a  simple  criterion  is 

£  I* -Mil2 

i— 1  x£Ri 

where  i?i,  1  <  *  <  c  are  the  clusters  of  a  particular  partition  and  /q  is 
the  average  of  the  examples  in  the  ith  cluster.  Minimizing  this  function 
over  the  space  of  possible  partitions  may  yield  an  optimal  classifier,  in 
particular  for  problems  that  have  well  separated  pattern  classes. 

Gradient  Descent  Procedures :  These  procedures  extremize  some  criterion  func¬ 
tion  in  order  to  obtain  the  desired  classification  rule.  For  instance,  feedfor¬ 
ward  neural  networks  can  implement  general  highly  non  linear  mappings 
from  the  feature  space  X  =  IR^  to  the  output  space,  Y,  which  for  classi¬ 
fication  problems  can  be  a  finite  set  of  integers  whose  elements  represent 
the  possible  classes. 

The  neural  net  classifier  is  represented  by  a  family  of  parametric  functions 
fw(x),  each  indexed  by  a  particular  vector  w  of  weights.  There  may  be 
many  neural  nets,  with  different  iu  that  yield  optimal  classifiers.  For  a 
randomly  drawn  test  vector  x  €  X,  a  criterion  function  e(w),  can  be 
defined  to  measure  the  expected  difference  between  the  classifier  output 
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fw(x)  and  the  correct  classification  y  of  the  vector  x.  Using  a  teaching  set 
of  examples,  (x,-,?/,),  1  <  i  <  m,  an  algorithm,  such  as  backpropagation 
(cf.  Hinton  et.  al.  [4]),  can  be  used  to  search  for  any  w  that  achieves  a 
global  minimum  of  the  criterion  function  by  stepping  in  small  increments 
in  the  direction  for  which  the  gradient  of  e(w )  is  minimum. 

One  of  the  main  difficulties  of  such  algorithms  is  choosing  the  starting 
point  of  the  search.  A  bad  starting  point  will  yield  a  sequence  of  gradient 
descents  leading  to  a  local  minimum  associated  with  a  nonoptimal  clas¬ 
sifier.  Adding  random  noise  to  the  learning  rule  may  help  increase  the 
chance  of  reaching  a  global  minimum. 

1.2  Labeled  and  Unlabeled  Examples 

From  our  vantage  point,  we  emphasize  that,  as  in  all  scientific  research  where  rules  are 
to  be  discovered,  observed  data  (or  examples)  is  of  primary  necessity  in  all  methods 
regardless  of  the  a  priori  partial  information.  There  are  two  fundamental  types  of 
examples,  labeled  and  unlabeled.  A  labeled  sample  is  a  collection  of  m  pairs  (x,-,  t/,),  1  < 
i  <  m  where  x,-  is  a  feature  vector  and  yt  is  its  corresponding  class  label;  ?/,■  €  {1,2}. 
The  class  label  xji  £  {1,2}  is  drawn  at  random  according  to  the  a  priori  probabilities  p\ 
and  p2;  the  feature  vector  xt  corresponding  to  y,  is  then  drawn  at  random  according  to 
the  class-conditional  density  fyi{x).  An  unlabeled  sample  consists  only  of  the  feature 
vectors  X;  drawn  according  to  the  mixture  density  /(x)  =  pi/i(x)  +  p2f2(x). 

Labeled  examples  clearly  contain  more  information  than  unlabeled  examples  and 
all  things  being  equal  would  be  the  preferred  form  of  data  for  the  learner.  However, 
as  T.  M.  Cover  [6]  indicates,  very  often  it  is  the  case  that  unlabeled  examples  are 
more  abundant  and  cheaper  to  acquire  than  labeled  examples  and  for  that  reason 
mixed-sample  learning  is  intuitively  attractive. 
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Consider  again  the  problem  of  medical  cancer  diagnosis.  Here  it  is  necessary 
to  recognize  malignant  cells.  Let  us  take  as  thesis  that  the  process  of  generating 
unidentified  malignant /benign  pictures  of  cells  from  both  cancer  patients  and  healthy 
persons  is  substantially  cheaper  than  having  an  expert  determine  whether  a  given  cell 
is  malignant:  say  that  one  needs  $100/pic,ture  for  an  expert  to  identify  the  cells  but 
only  $  10/picture  for  a  technician  who  takes  the  pictures  *.  Ideally,  one  would  want  to 
take  say  100  pictures  and  have  the  expert  label  only  10  of  these  as  being  malignant 
or  not  and  then  feed  the  whole  information  to  a  computer  and  with  some  clever 
algorithm,  learn  the  classification  to  within  a  small  error;  this  is  preferred,  costwise, 
over  taking  25  pictures  and  having  an  expert  label  all  of  them  t .  As  it  stands  today,  the 
practitioner  must  resort  to  a  variety  of  heuristics  and  knowledge  from  past  experience 
in  order  to  decide  how  many  labeled  and  unlabeled  examples  need  to  be  procured  to 
obtain  a  good  classification  rule.  As  another  example,  consider  the  task  of  classifying 
trees  in  a  forest  by  their  names.  Unlabeled  examples  are  free  as  there  are  practically 
an  endless  number  of  trees  that  one  can  examine.  The  human  expert  charges  by  the 
hour  for  supplying  the  names  of  trees. 

Let  us  denote  explicitly  by  Perr0T(m,n)  the  objective  error  (for  a  fixed  algorithm) 
given  a  labeled  sample  of  size  m  and  an  unlabeled  sample  of  size  n.  Our  interest  here  is 
to  present  a  theoretical  analysis  that  provides  an  insight  into  the  tradeoff  between  the 
finite  unlabeled  and  labeled  sample  sizes  needed  to  learn  (i.e.,  determine  Perr0T(m,  n)). 
The  question  may  be  succinctly  put  as  follows:  How  many  unlabeled  examples  is  one 
labeled  example  worth? 

In  this  thesis  we  present  an  approach  which  answers  this  question  for  some  classi¬ 
fication  problems  under  different  scenarios.  Each  scenario  depends  on  the  additional 

’These  figures  may  improve  pending  the  fate  of  the  new  health  care  plan  ! 
t  Of  course,  what  one  really  wants  to  do  is  to  minimize  the  Bayes  risk  for  appropriate  loss  functions 
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side  information  that  is  given  to  the  learner,  for  instance  the  parametric  form  of  the 
underlying  mixture  density,  and  also  on  the  particular  algorithm  which  approaches 
the  estimation  of  the  decision  rule  by  a  specific  technique. 

1.3  Organization 

In  Chapter  2  we  present  some  additional  motivation  for  being  interested  in  the  trade¬ 
off  between  the  number  of  labeled  and  unlabeled  examples  for  learning  a  classification 
rule.  We  refer  to  several  established  results  that  touch  upon  this  area  in  the  limiting 
sense — with  infinitely  many  unlabeled  examples  and  just  one  labeled  example  a  deci¬ 
sion  rule  having  Perr0r  =  2Psa3/es(l  —  P Bayes)  can  be  achieved.  Roughly  speaking  this 
means  that  the  first  labeled  example  contains  one  half  of  the  classifying  information 
(cf.  Cover  &  Castelli  [5]).  Still  with  an  infinity  of  unlabeled  examples,  as  we  increase 
the  number,  m,  of  labeled  examples  the  classification  error  goes  arbitrarily  close  to 
PBayes  and  exponentially  fast  with  m.  The  question  remains  as  to  how  fast  does  the 
error  decrease  with  respect  to  increasing  the  unlabeled  sample  size.  In  Chapters  4,  5, 
6  we  determine  the  rates  under  different  scenarios.  We  then  describe  our  approach 
for  learning  with  a  mixed  sample,  which  follows  the  Probably  Approximately  Correct 
(PAC)  model  of  learning  with  examples.  With  the  technical  background  on  which 
PAC  is  based  we  can  analyze  learning  with  a  mixed  sample  under  different  scenarios 
of  side  information  given  to  the  learner.  We  explore  two  such  scenarios —  a  paramet¬ 
ric,  based  on  the  MLE  principle  and  a  nonparametric  one  based  on  Kernel  Density 
Estimation. 

In  Chapter  3  we  present  the  necessary  technical  background  in  the  form  of  es¬ 
tablished  theorems,  definitions  and  examples.  The  main  technical  results  needed  are 
various  uniform  strong  laws  of  large  numbers  over  classes  of  functions  which  arise 
from  the  pioneering  work  of  V.N.  Vapnik  and  A.  Ya.  Chervonenkis.  The  results 
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themselves  constitute  powerful  generalizations  of  the  earliest  and  best  known  uni¬ 
form  strong  law— the  Glivenko-Cantelli  theorem.  The  basic  form  of  the  uniform 
strong  law  that  will  be  invoked  asserts  that,  for  various  general  function  classes,  the 
averages  of  functions  (evaluated  from  a  random  sample)  converge  uniformly  and  expo¬ 
nentially  fast  to  the  corresponding  expectations,  this  rates  being  governed  by  various 
covering  numbers  and  a  combinatorial  parameter  known  as  the  Vapnik-Chervonenkis 
dimension. 

The  main  contributions  of  the  thesis  are  contained  in  Chapters  4,  5  and  6.  In 
order  to  compare  the  tradeoff  between  labeled  and  unlabeled  examples  under  the  para¬ 
metric  and  nonparametric  scenario,  we  focus  on  learning  the  same  problem,  namely 
two  AT-dimensional  Gaussian  distributed  pattern  classes,  but  with  the  learner  hav¬ 
ing  different  amounts  of  side  information.  Our  overall  approach  is  to  obtain  sample 
complexities,  i.e.,  the  sufficient  number  m  of  labeled  examples  and  the  number  n  of 
unlabeled  examples  for  learning  to  classify  under  a  prespecified  accuracy  e  and  con¬ 
fidence  1—6.  With  the  theory  of  Chapter  3  we  obtain  finite  bounds  on  the  sample 
complexities  which  enables  us  to  quantify  the  tradeoff  between  m  and  n. 

In  Chapter  4  we  investigate  the  parametric  scenario  where  the  learner  knows 
the  form  of  the  underlying  probability  densities.  We  present  theorems  which  state 
the  finite  sample  complexities  for  two  parametric  algorithms,  E  and  M.  Algorithm  E 
utilizes  a  purely  labeled  sample  of  size  m  which  is  polynomial  in  \  and  in  the  dimen¬ 
sionality  N.  Algorithm  M,  which  is  based  on  maximizing  the  likelihood  function, 
utilizes  a  mixed  sample.  The  unlabeled  examples  are  used  for  estimating  the  decision 
border,  and  the  labeled  examples  are  used  to  determine  the  optimal  labeling  of  the 
partition.  The  proof  of  the  sample  complexities  for  algorithm  M  is  intricate.  We  pro¬ 
vide  a  preview  of  the  proof  and  explain  the  basic  concepts  underlying  it.  As  expected, 
algorithm  M  requires  fewer  labeled  examples  than  algorithm  E  on  account  of  using 
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the  unlabeled  sample.  We  take  the  ratio  of  the  number  of  unlabeled  examples  over 
this  reduction  as  a  representative  of  the  value  of  one  labeled  example  when  learning 
under  this  parametric  scenario.  This  is  shown  (in  Chapter  6)  to  be  polynomial  in 
N,  7,  and  log  We  then  investigate  the  sample  complexities  under  the  parametric 
scenarios  of  equal  and  different  a  priori  class  probabilities. 

Chapter  5  focuses  on  the  nonparametric  scenario.  The  learner  has  no  knowledge 
about  the  form  of  the  underlying  class  densities  and  resorts  to  extracting  all  of  the 
information  solely  from  the  n  unlabeled  examples  and  m  labeled  examples.  Our 
approach  is  to  use  the  modes  of  the  Gaussian  mixture  to  determine  the  Bayes  decision 
border.  Our  algorithm  utilizes  Kernel  Density  Estimation  to  estimate  the  unknown 
mixture  density  /  by  fn{x).  Using  fn(x),  it  then  constructs  estimates  of  the  modes  of 
/  which  are  shown  to  be  consistent  whence  the  decision  rule  can  have  PerT0T  arbitrary 
close  to  Psayes •  We  provide  a  theorem  which  states  the  finite  mixed  sample  complexity 
for  this  algorithm.  The  proof  again  utilizes  the  theory  of  Chapter  3,  where  the  uniform 
strong  law  of  large  numbers  plays  a  principle  role  in  admitting  a  measure  of  complexity 
for  this  nonparametric  approach.  We  then  use  the  finite  sample  complexities  to 
establish  the  tradeoff  between  m  and  n.  This  is  shown  (in  Chapter  6)  to  be  exponential 
in  N,  7.  Requiring  no  parametric  information  suggests  that  the  algorithm  can  be 
applied  to  other,  non  Gaussian,  problems  where  the  Bayes  decision  border  can  be 
identified  via  the  modes  of  the  class  density  mixture.  We  describe  the  family  of 
problems  for  which  algorithm  K  can  be  used  and  show  that  the  decision  rule  is  still 
close  to  optimal  when  the  sample  complexities  are  as  for  the  Gaussian  problem. 

Also  in  Chapter  5,  we  discuss  two  other  nonparametric  approaches  to  learning 
classification —  the  Kohonen  LVQ  neural  network,  and  the  fc-means  procedure.  The 
Kohonen  LVQ  neural  network  utilizes  primarily  the  unlabeled  sample  to  adapt  a 
fixed  number  of  vectors,  called  neurons,  according  to  a  sequential  learning  rule  which 
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performs  gradient  descent  in  the  space  of  mean  square  error.  In  different  fields, 
e.g.,  vector  quantization,  pattern  classification,  speech  recognition,  etc.  practitioners 
have  reported  successful  results  using  this  type  of  learning  rule.  In  our  experimental 
investigation  we  focus  on  the  tradeoff  between  the  number  of  labeled  and  unlabeled 
examples  that  are  necessary  to  achieve  a  specified  classification  accuracy.  We  consider 
experiments  that  aim  to  reduce  the  labeled  sample  complexity,  m,  by  adding  a  second 
layer  of  neurons.  This  may  be  useful  in  situations  where  a  labeled  example  is  costly 
and  unlabeled  examples  are  abundant.  We  report  on  the  family  of  classification 
problems  for  which  a  significant  reduction  in  m  w.r.t.  the  dimensionality  N  is  evident. 

The  ad  hoc  clustering  procedure  known  as  the  A;-means  method  is  based  on  a 
voronoi  partition  with  center  vectors  yt  that  adapt  in  a  way  to  minimize  the  empirical 
mean  square  error  (MSE)  based  on  the  randomly  drawn  unlabeled  sample.  The  true 
MSE  is  defined  as  Emini<j<fc  |x  —  ?/,|2  which  measures  the  discrepancy  between  the 
input  x  and  the  output  y  where  y  is  one  of  the  k  vectors  In  some  problems 
the  minimum  MSE  partitions  achieve  classification  rates  that  are  optimal  or  close 
to  optimal  when  labeled  correctly.  We  consider  such  a  learning  problem  and  using 
the  uniform  convergence  laws  of  Chapter  3  we  obtain  the  sufficient  mixed  sample 
complexities. 

In  Chapter  6  we  accumulate  the  results  of  previous  chapters  and  report  the  trade¬ 
off  between  the  unlabeled  and  labeled  examples  for  learning  a  classification  rule  under 
the  different  scenarios.  We  discuss  another  possible  approach  based  on  algorithm  I< 
of  Chapter  5  which  uses  fewer  unlabeled  examples  by  estimating  the  mixture  density 
/  only  at  a  finite  number  of  points.  Here  however  the  learner  needs  more  side  in¬ 
formation  than  in  algorithm  I<  as  the  knowledge  of  a  compact  region  which  contains 
the  modes  of  f  is  a  necessary  condition.  We  then  discuss  our  ongoing  work  and  a 
conjecture  about  the  sample  complexity  for  the  Gaussian  mixture  problem  with  non- 
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unit  covariances.  We  also  briefly  discuss  several  extensions  to  learning  with  different 
types  of  examples  and  issues  relating  to  side  information. 


The  following  table  lists  the  notation  that  will  be  used. 


E 

expectation 

P 

probability  measure 

6 

positive  accuracy  parameter 

1-8 

positive  confidence  parameter 

Qo 

the  true  unknown  parameter  vector 

B(0 o,  e) 

a  ball  of  radius  e  centered  at  60 

dB(0o,  e) 

the  surface  of  the  ball  of  radius  e  centered  at  6q 

m 

likelihood  function  based  on  the  sample  (a?i, . . .  ,xn)  evaluated  at  6 

0e 

point  on  the  surface  of  the  ball  B{9q ,  e) 

1*1  =  vfciil  *? 

Euclidean- norm  of  iV-dimensional  vector  x  =  [xl5  x2, . . . ,  xjv] 

VC  (H) 

VC-dimension  of  class  7i 

d 

VC-dimension 

N 

dimension  of  example-space 

n 

number  of  unlabeled  examples 

m 

number  of  labeled  examples 

covering  number  for  class  H  under  L1-norm  with  probability  measure  Q 

0 

parameter  space  (compact  set  in  Euclidean  space) 

pohf(x) 

rth- degree  polynomial  in  x 

U 

indicator  function  of  the  set  A 

^1  ?  ^2  >  ^3  >  •  *  * 

positive  constants 

Table  1.1:  Notations 
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Chapter  2 
The  Problem 


In  the  preceding  section  we  argued  that  a  mixed-sample  approach  to  learning  a  clas¬ 
sification  rule  has  high  appeal  in  practice.  It  is  of  interest  to  try  to  quantitatively 
describe  the  tradeoff  in  unlabeled  versus  labeled  sample  sizes  required  for  learnabil- 
ity.  To  further  our  intuition,  let  us  deduce  some  simple  limiting  results  when  either 
the  number  of  labeled  examples  m  — »  oo  or  when  the  number  of  unlabeled  examples 
n  — >  oo. 

Begin  with  the  following  observation:  With  an  unlimited  supply  of  independently 
drawn  examples  we  can  estimate  any  probability  distribution  function  arbitrarily  well. 
This  is  based  on  an  extension  of  the  Glivenko-Cantelli  theorem,  the  oldest  and  best 
known  uniform  strong  law  of  large  numbers  (cf.  Pollard  [21]). 

Consider  first  the  case  m  =  oo,  n  =  0,  where  there  are  an  infinity  of  labeled 
examples  and  no  unlabeled  examples.  An  appeal  to  the  Glivenko-Cantelli  theorem 
shows  that  we  can  obtain  exact  estimates  of  each  of  the  pattern-class  probability 
distributions,  and  hence  deduce  the  optimal  Bayes  rule.  Consequently 

Perror  (oo,0)  =  Pb  ayes • 

Unlabeled  examples  alone  are  not  sufficient  to  learn  a  decision  rule;  with  infinitely 
many  unlabeled  examples  the  mixture  density  /  =  pi/i  +  P2f2,  can  be  learned  exactly 
(via  the  Glivenko-Cantelli  theorem)  but  even  if  the  decision  border  can  be  uniquely 
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identified  from  the  mixture  we  still  need  labeled  examples  to  associate  every  decision 
region  with  one  of  the  two  classes.  Indeed  suppose  there  are  no  labeled  examples, 
m  =  0,  and  an  infinity  of  unlabeled  examples  n  =  oo,  and  suppose  the  best  case 
where  the  class- conditional  densities  fuf2  and  the  priors  pi,p2,  can  be  extracted  from 
the  mixture  (this  notion  of  identifiability  can  be  made  mathematically  precise).  The 
Bayes  decision  border  can  thus  be  procured  but  we  are  still  left  with  the  problem 
of  deciding  the  label  “1”  or  “2”  for  the  two  regions  Rx,  R2.  With  no  recourse  but 
random  guessing  one  of  the  two  labelings  we  hence  obtain 

P error(0,  Oo)  —  ~^PBayes  T  Pb ayes') 

a  result  no  better  than  randomly  guessing  the  label  of  an  x  in  the  first  place!  Nev¬ 
ertheless,  it  is  clear  that  unlabeled  examples  do  carry  information,  and  with  a  small 
amount  of  additional  information  in  the  form  of  labeled  examples  we  should  be  able 
to  exploit  this  untapped  source  of  information  as  we  see  in  the  sequel. 

Let  us  now  restrict  ourselves  to  an  identifiable  mixture  distribution  f(x )  (cf. 
Teicher  [29]).  In  particular,  if  f(x)  =  n g(x)  +  (1  -  i r)h(x),  we  can  identify  the  exact 
value  of  7 r  and  the  form  of  g(x)  and  h(x)  given  f(x).  Note,  however,  that  it  is  not 
known  which  of  g(x)  and  h(x)  is  f^x)  (the  other  will  be  f2(x))  and  whether  7 r  =  px 
or  7r  =  p2.  Nevertheless,  given  an  infinity  of  unlabeled  examples,  the  Bayes  decision 
border  {x  :  P\fi(x)  =  p2/2(a:)}  =  :  rv(x)  =  (1  -7r)Mx)}  can  be  identified.  This  is 

because  the  identifiable  mixture  f(x)  is  obtained  via  the  Glivenko-Cantelli  theorem 
and  the  Bayes  border  is  invariant  to  the  labeling  of  g  and  h.  As  before,  this  decision 
border  optimally  partitions  the  feature  space  into  two  disjoint  regions  Ri  and  R2  and 
the  difficulty  is  that  we  do  not  know  which  region  should  be  labeled  “1”  and  which 
“2” . 

Now  suppose  we  have  one  labeled  example  (x,  ?/),  i.e.,  rn  =  1.  Denote  by  E  the 
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event  that  a  random  x  drawn  according  to  f(x)  is  misclassified.  Draw  one  labeled 
example  by  first  choosing  a  class  according  to  the  a  priori  probabilities  pi,p2  and  then 
drawing  an  x  according  to  the  density  of  the  chosen  class  (which  is  either  fi(x)  or 
f2(x)).  Then  label  as  class  y  the  region  that  contains  x ,  and  the  second  region  by  the 
complement  label.  Clearly,  there  are  two  possible  labelings:  one  has  R\  labeled  “1”, 
P2  labeled  “2”;  this  corresponds  to  the  Bayes  optimal  labeling  (denote  it  as  Lgood) 
which  has  (conditional)  error  probability  Perror  —  P(P| Lgood)  =  PBayes]  the  other 
labeling  has  Ri  labeled  “2”,  and  R2  labeled  “1”  (denoted  by  Lbad)  and  its  conditional 
error  probability  is  given  by  Perror  =  P(P|  Lbad)  =  1  —  PBayes •  Any  one  of  these  two 
might  be  chosen.  Consequently,  the  unconditional  error  probability  of  our  decision 
rule  is  given  by 


Perror  (1,00)  =  P(^) 


P(E\Lbad)P(Lbad)  +  P(E\Lgood)P(Lgood) 

(1  -  PBayes)P(Lbad)  +  Pb  ayes  P  (Egood) 


We  have 


P(Lbad)  - 


+ 


P(x  has  true  label  “1”  and  x  fell  in  R2) 
P(x  has  true  label  “2”  and  x  fell  in  Pi). 


This  equals 


Pi  /  fl(x)  dx  +  P2  /2M  dx  =  Psayes- 
JR2  JRi 

Clearly  P(Lgood)  =  1  —  PBayes  hence  the  total  misclassification  probability  P(E)  of 
the  resulting  classifier  is 


Perror (1 5  ®®)  —  2PgUj,es(l  PBayes )  ^  2 PBayes • 

Therefore  for  any  problem  with  an  identifiable  class  mixture  and  for  any  algorithm 
that  produces  a.  decision  rule  utilizing  n  =  oo  unlabeled  examples  and  m  =  1  labeled 
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examples  the  classification  performance  is  no  worse  than  twice  the  best  achievable 
error  performance  !  This  result  was  demonstrated  by  T.  M.  Cover  and  V.  Castelli 
[5]  who  also  considered  the  case  when  there  is  more  than  one  labeled  example.  We 
tackle  this  next. 

With  a  few  more  labeled  examples  we  can  rapidly  get  as  close  as  desired  to  Bayes 
performance.  Suppose  we  have  m  labeled  examples  and  n  =  oo  unlabeled  examples. 
Use  the  infinity  of  unlabeled  examples  to  deduce  the  Bayes  border  {x  :  irg(x)  = 
(1  —  w)h(x)}.  Now  w.l.o.g.  suppose  Ri  =  {x  :  Tcg(x)  >  (1  —  7r)h(a:)},  R2  is  the 
complement  of  Ri,  and  h(x)  =  /2(x),  g(x)  =  fi(x).  We  can  determine  exactly  the 


quantities  rft  =  P(2|x  6  i?i),  r\2  =  P(l|ar  G  R2),  and  p  =  fRi  f(x)  dx.  The  quantities 
gi  and  772  are  the  probabilities  that  a  randomly  drawn  test  example  x  is  misclassified 
given  it  is  in  Ri  or  R2,  respectively.  Also,  p  =  P(Ri)  and  1  —  p  =  P(R2).  The 
procedure  for  labeling  the  regions  is  as  follows:  draw 


m  = 


_ 1  ,  3 

Pmin  (l  2\/gmax(^  Vmax  )  ^ 


labeled  examples,  where  pmin  =  min(p,  1  —  p)  and  T]max  =  max(p1,p2)  and  6  >  0  is 
arbitrarily  small.  Assign  to  each  region  R 1  and  /?.2,  the  label  of  the  majority  of  the 
examples  that  fell  in  it.  If  no  examples  fell  in  Rt  then  label  it  “1”  with  probability  | 
and  “2”  with  probability  |.  Then  the  resulting  classifier  has  error  probability 


P error  (m,  OO)  <  PBayes{l  ~  25)  +  45. 


(Cover  &  Castelli  [5]  have  shown  a  similar  bound.)  We  now  briefly  prove  this  result. 
Let  E  denote  the  event  that  a  random  x  is  misclassified.  There  are  four  possible 
labelings  of  the  regions  R\,  J?.2,  based  on  the  labeled  examples:  Lg00li  has  R\  labeled 
“1”  and  R2  labeled  “2”;  Lbad,\  has  R1  labeled  “2”  and  R2  labeled  “1”;  Lbada  has  Rl 
and  R2  labeled  “1”;  Lb ad£  has  Ri  and  R2  labeled  “2”.  We  have 

P(E)  =  P(E\Lgood)P(Lgood)  +  P(E\Lbadti)P(Lhad,i)  +  P(E\Lbad,2)P(Lbad,2) 
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+  P(E\Lba(l,3)P(Pbad,3)- 


We  have  P(E\Lg00(i)  =  PBayes  and  clearly  P(Lgoo<i )  <  1.  Also,  P(E\Lbad,i)  =  l—PBayes 
(see  earlier),  while  P(E\Lbadt2)  =  P2  and  P(E\Lbaril3)  =  Pi- 

Now,  P(Lbad,2)  equals  the  probability  that  the  majority  of  examples  in  Ri  are 
“1”  and  that  the  majority  of  examples  in  R2  are  “1”,  or  that  the  majority  of  examples 
in  Ri  are  “1”  and  none  fell  in  R2  and  “1”  was  chosen,  or  that  the  majority  in  R2  are 
“1”  and  none  fell  in  R\  and  “1”  was  chosen.  Similarly  P(^6ad,3)  equals  the  probability 
that  the  majority  of  examples  in  Rx  are  “2”  and  that  the  majority  of  examples  in 
R2  are  “2”,  or  that  the  majority  of  examples  in  R\  are  “2”  and  none  fell  in  R2  and 
“2”  was  chosen,  or  that  the  majority  in  R2  are  “2”  and  none  fell  in  R\  and  “2”  was 
chosen.  We  have  the  probability  that  the  majority  of  examples  in  R2  are  “1”  given 

by 

m  k 

=  £  (?)  (i  -  p)‘(p)m“l  •  £  C)  ik  1  -  *)*-'• 

k= 1  j=k/2 

Using  Chernoff ’s  bound  for  a  binomial  distributed  random  variable  we  can  bound  the 
inner  sum  by  (4??2(1  —  rj2))k^2  whence  obtain  the  upper  bound 


— m(l— p) 

e 


^1-2^/7/2(1-772)^ 


<  e 


—  mpmin  “‘ZyJ'Hmax  (1  —  Wmax 


Similarly,  the  probability  that  the  majority  of  examples  in  B, i  are  “2”  is  given  by 

£  (?)  p‘( i  -  p)m_‘  •  £  0)  <f((i  -  m?-*  < 

k= 1  }=k/2 


We  also  have  that  the  probability  of  the  majority  in  R.^  are  “1”  and  majority  in  R2 
are  “1”,  is  less  than  the  probability  that  the  majority  of  examples  in  R2  are  “1”. 
Also,  the  probability  that  the  majority  in  R,\  are  “2”  and  the  majority  in  R2  are  “2”, 
is  less  than  the  probability  that  the  majority  in  R\  are  “2”.  Similarly,  the  probability 
that  the  majority  in  R\  are  “1”  and  none  fell  in  R2  and  “1”  was  chosen,  is  less  than 
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the  probability  that  none  fell  in  R2  and  “1”  was  chosen,  which  is  bounded  above  by 


(1  _  «)"*!  <  -e~mp  <  <  -e 

v  *  2  ~  2  ~  2  ~  2 


1  m pmin  1 1  — 2-v/ T)mal(l“^mox)  ] 

26 

the  last  inequality  following  since  (l  —  2-Ji]max(\  —  r}max))  <  1.  Define 

^  —mpmin  lyj' ^?mai(l  %ioi)j 


Using  the  above  we  have 

P  {Lbad^)  <  6  +  -S  +  —6  =  26,  and  P  ( L^ad.z )  <  26. 
We  can  similarly  obtain  that 

P(Lbad,l)<26. 

Thus  we  have 


P(-U)  <  PBayes  '  1  +  />22<$  +  Pl2£  +  (1  —  PBayes)26 

=  PBayes(l  ~  26)  +  U. 

The  left  side  is  by  definition  Perror(m,  oo).  This  concludes  the  proof.  As  6  >  0 
can  be  arbitrarily  small,  we  conclude  that  given  an  infinity  of  unlabeled  examples, 
as  the  labeled  sample  size  m  increases,  the  classifier  performance  approaches  P Bayes 
exponentially  fast  in  m. 

Related  to  the  limiting  case  of  Perror(oo,0)  is  a  classical  result  of  Cover  k  Hart 
[33]  pertaining  to  the  nearest  neighbor  (NN)  classification  rule  (a  nonparametric 
method).  This  classifier  is  based  on  a  voronoi-partition  of  the  feature  space,  each 
voronoi  cell  placed  around  one  labeled  example.  They  bound  the  error  of  the  NN- 
classifier  in  the  oo-labeled  sample  limit  as 

PBayes  <  Perror,NN(oO,0)  <  2Pg  ayes  (l  -PB  ayes) • 
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An  improvement  towards  a  more  realistic  case  (having  finite  sample  size  instead  of 
infinite)  has  been  recently  achieved  by  Psaltis,  Snapp  &  Venkatesh  [15],  who  showed 
that 

Perror,  NN(m.,0)  =  PBayes  +  0(m  2/N) 

for  the  nearest  neighbor  classifier  in  N- dimensional  feature  space. 

Finally,  for  mixed-sample  learning,  the  case  of  PeTT0T(m,n)  is  the  most  realistic 
since  both  sample  sizes  are  finite.  The  form  of  the  solution  depends  strongly  on 
the  algorithm  used  to  learn  the  classification  rule.  Different  methods  may  utilize 
different  assumptions,  or  side  information,  regarding  the  class  distributions  and  there 
may  be  various  ways  of  learning  from  a  mixture  of  labeled  and  unlabeled  examples. 
For  instance  the  learning  of  the  decision  border  may  be  done  solely  with  unlabeled 
examples  while  leaving  the  labeled  sample  only  for  labeling  the  regions.  Another 
approach  would  use  both  labeled  and  unlabeled  examples  to  learn  the  border.  Our 
effort  in  this  thesis  is  dedicated  to  tackle  the  Perror(m,n )  case.  In  this  context,  Cover 
&  Castelli  have  suggested  that  for  identifiable  families,  PerT0T(m,n )  may  take  the 
form  0(~)  +  0(e~am),  i.e.,  that  there  is  an  exponential  tradeoff  between  labeled  and 
unlabeled  examples. 

Analyzing  the  size  of  finite  labeled- samples  as  the  basis  of  learning-complexity  is 
the  approach  taken  in  the  PAC  (probably  approximately  correct)  model  of  learning 
theory  (Valiant  [34],  Blumer,  Ehrenfeucht,  Haussler  &  Warmuth  [13], [14],  Haussler 
[12])  and  also  in  the  analysis  of  the  nearest-neighbor  classifier  of  Psaltis,  Snapp  & 
Venkatesh  [15].  (In  contrast,  another  approach  for  representing  learning-complexity 
is  to  get  asymptotic  oo-sample  limits  as  for  instance,  in  Cover  Sz  Hart  [33].)  In  these 
approaches,  which  utilize  only  labeled  examples,  the  finite  sample  size  may  depend  on 
a  prespecified  required  accuracy,  probabilistic-confidence  of  the  result,  dimensionality 
of  the  feature  space  and  possibly  more  given  parameters. 


23 


The  theory  on  which  PAC  learning  is  based  can  be  used  to  analyze  learning  with 
a  mixed  sample,  i.e.,  with  both  unlabeled  and  labeled  examples,  and  obtain  finite 
bounds  on  m  and  n.  These  estimates  can  then  be  used  as  a  measure  of  quantifying 
the  tradeoff  in  unlabeled  versus  labeled  examples  in  learning  a  classification  decision 
rule.  However,  as  was  discussed  in  the  previous  section,  there  are  really  three  forces 
at  play  here:  the  number  of  labeled  examples  m,  the  number  of  unlabeled  examples 
n,  and  the  amount  of  side  information  given  a  priori  to  the  learner  (for  instance,  in 
terms  of  assumptions  on  the  class  conditional  densities).  To  see  the  tradeoff  between 
any  two  of  these  three  variables  we  need  to  fix  the  third.  It  is  not  clear  how  to 
quantify  side  information;  there  are  still  open  issues  to  tackle  here.  Our  approach 
will  be  to  compare  the  tradeoff  between  labeled  and  unlabeled  sample  sizes  under 
several  (qualitative)  scenarios  of  side  information  available  to  the  learner. 

Finite  sample  complexities  results  are  more  difficult  to  derive  than  asymptotic 
results,  and  typically  require  a  case-by-case  analysis  —  a  fully  general  theory  is  still 
in  abeyance.  In  this  thesis  we  primarily  investigate  two  scenarios:  (1)  the  tradeoff 
between  m  and  n,  conditioned  on  the  knowledge  of  the  parametric  form  of  the  class- 
conditional  densities;  the  analysis  and  results  are  presented  for  the  specific  parametric 
case  of  a  multi- dimensional  Gaussian  mixture  though  the  technique  extends  to  other 
parametric  families  (2)  The  tradeoff  between  m  and  n  conditioned  on  the  knowledge 
that  the  modes  of  the  mixture  determine  the  Bayes  optimal  decision  border  (neither 
the  parametric  form  of  the  mixture  nor  information  about  whether  it  is  identifiable, 
are  given  to  the  learner).  The  function  classes  considered  in  the  latter  case  are 
potentially  much  larger  than  the  former  parametric  case. 

We  approached  scenario  (1)  by  choosing  two  parametric  estimation  methods,  mo¬ 
ment  estimation  and  maximum  likelihood  parameter  estimation  (MLE).  The  former 
is  easily  applied  to  the  case  of  a  purely  labeled  sample  since  it  involves  estimating 
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independently  the  moments  of  the  two  class  conditional  densities;  the  latter  method 
can  utilize  unlabeled  examples  to  estimate  the  mixture’s  parameters  and  labeled  ex¬ 
amples  to  choose  the  good  labeling.  Chapter  4  presents  the  analysis  and  results  for 
this  scenario.  In  scenario  (2),  we  compared  learning  with  only-labeled  examples  by 
the  parametric  moment-estimation  method,  to  learning  with  a  mixed  sample  with  the 
Kernel  Density  Estimation  method.  The  kernel  based  method  invoked  here  can  utilize 
unlabeled  examples  and  requires  no  knowledge  neither  about  the  form  of  the  class 
mixture  nor  whether  it  is  identifiable  but  does  utilize  prior  knowledge  that  the  modes 
of  the  mixture  f(x)  determine  the  Bayes  border  for  the  class  under  investigation. 
Results  are  presented  in  Chapter  5. 

In  both  scenarios,  the  sample  size  of  the  purely  labeled  parametric  approach 
represents  a  lower  limit  on  the  necessary  number  of  labeled  examples  for  learning 
classification  when  unlabeled  examples  are  unavailable  since  the  sufficient  statistics 
are  accessible  and  hence  the  method  is  efficient.  One  should  expect  that  unlabeled 
examples  are  worth  something  and  hence  anticipate  a  reduction  in  the  labeled  sample 
size  when  learning  with  a  mixed  sample  approach  compared  to  the  purely-labeled 
approach.  As  we  will  see,  the  relative  amounts  of  side-information  available  to  the 
learner  in  the  two  scenarios  determines  the  tradeoff. 

One  common  denominator  between  the  MLE  and  the  Kernel  estimation  tech¬ 
nique  used  here,  is  that  they  both  can  be  analyzed  using  the  technical  machinery  of 
the  uniform  SLLN  (reviewed  in  Chapter  3).  This  is  also  the  fundamental  principle 
behind  the  main  branch  of  the  field  of  computational  learning  theory.  Using  this  the¬ 
ory,  the  complexity,  or  cost  of  learning  general  abstract  problems,  and  also  practical 
problems  such  as  classification,  regression,  can  be  expressed  quantitatively.  A  primary 
measure  of  cost  is  the  number  of  examples  that  are  sufficient  (or  even  necessary)  to 
learn  the  problem  to  within  a  prespecified  accuracy  and  confidence.  In  subsequent 
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chapters  we  base  all  of  our  algorithmic  complexity  measures  on  the  sufficient  sample 
sizes  for  learning  a  classification  problem. 

To  compare  the  tradeoff  between  both  scenarios,  we  restrict  our  discussion  largely 
to  learning  one  common  classification  problem,  in  which  the  two  classes  are  distributed 


f.(x)  = - l -  -i|*-w|2 

h{  ’  (2tt)"/2 


for  i  =  1,2,  and  x €  IR  ,  \% 


-  .  L.2 


\  +  . . .  +  x2n,  and  the  classes  have  a  priori 


probabilities  pi,  p2-  In  scenario  (1)  this  problem  belongs  to  a  parametric  family  of 
classification  problems  since  the  form  of  the  mixture  is  known  to  the  learner.  In 


scenario  (2)  this  same  problem  belongs  to  a  family  of  nonparametric  classification 
problems  where  the  learner  knows  very  little  about  this  mixture.  The  tradeoff  between 
unlabeled  and  labeled  examples  is  significantly  different  in  both  scenarios  as  will  be 


shown  in  subsequent  chapters. 

In  both  scenarios,  for  the  mixed-sample  methods,  we  used  unlabeled  examples 
to  estimate  the  mixture  density  (thereby  learn  the  decision  border)  while  labeled 
examples  were  used  solely  for  labeling  the  decision  regions.  Had  we  chosen  a  different 
approach  which  also  utilizes  the  labeled  sample  for  learning  the  decision  border,  we 
might  have  needed  fewer  unlabeled  examples.  So  in  this  respect,  our  results  give 
an  upper  bound  on  the  number  of  unlabeled  examples  required  in  a  trade  for  every 
labeled  example,  when  conditioned  on  fixed  side  information. 

For  scenario  (1)  we  also  considered  the  case  where  the  class  a  priori  probabilities 
are  different  and  obtained  the  unlabeled  versus  labeled  examples  tradeoff.  This  is 
presented  in  Chapter  4.  In  Chapter  5  we  investigated  two  additional  nonparametric 
algorithms  which  use  a  mixed  sample.  We  present  computer  simulation  for  a  LVQ 
neural  network  classifier,  and  theoretical  analysis  for  a  related  algorithm  called  k- 


means. 
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Before  presenting  our  work,  we  first  review  several  established  theorems  and 
examples  of  the  theory  which  we  use  in  succeeding  sections. 
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Chapter  3 

Technical  Results 

3.1  The  Strong  Law  of  Large  Numbers 

In  this  chapter  we  present  several  needed  technical  results  to  be  used  in  calculating  the 
finite  values  n  and  rn  which  are  sufficient  for  the  two  density  estimation  methods — 
maximum  likelihood  estimation  (MLE)  and  kernel  density  estimation.  Both  the  MLE 
and  the  kernel  methods  involve  learning  with  randomly  drawn  unlabeled  examples. 
It  is  possible  to  represent  each  as  a  problem  of  approximating  the  expectations  of 
functions,  /t,  in  some  large  class  H  by  the  empirical  means,  £  h(xi)  where 
xi,...,xn  denotes  a  random  sample.  We  would  like  to  assert  that  the  empirical 
means  ^  h{%i)  converge  uniformly  over  the  class  7i  to  the  corresponding  expec¬ 

tations  E/i(.t).  This  will  be  achieved  by  the  principle  of  the  uniform  SLLN  which 
is  hence  at  the  heart  of  both  the  MLE  and  kernel  methods  (as  well  as  many  other 
methods). 

The  SLLN  is  one  of  the  fundamental  laws  in  statistics  and  it  arises  whenever 
an  empirical  procedure  which  involves  randomly  drawn  observations  is  believed  to  be 
governed  by  the  laws  of  probability.  In  practice,  we  can  only  run  empirical  methods, 
e.g.,  MLE  and  kernel.  The  SLLN  assists  in  bridging  the  gap  between  inference  based 
on  empirical  measurements  and  that  based  on  probability.  We  now  present  some  fun¬ 
damentals  of  the  theory  of  uniform  SLLN  convergence  for  empirical  means  of  functions 


to  their  expectations  (these  results  were  pioneered  by  Vapnik  &  Chervonenkis  in  [16], 

[17])- 

The  classical  SLLN  of  Kolmogorov  shows  that  we  have  arbitrarily  small  devia¬ 
tions  between  the  empirical  and  true  mean  of  a  function  h,  with  probability  1. 

Theorem  3.1  (SLLN)  If  xif  1  <  i  <  n,  is  an  i.i.d.  sequence  of  random  variables 
with  finite  expectation  E|x|  <  oo,  then 

1  n 

—  ^2 x'  ~~ >  ^  a’e' 
n  i= i 

Consequently  for  any  measurable  function  h( x)  with  E|/i(.r)|  <  oo, 

—  ^2  hfxi)  — ♦  E h(x)  a.e. 

n  »=i 

This  guarantees  a.e.  convergence  for  any  single  function  h  6  H  and  consequently 
jointly  for  any  finite  collection  of  functions.  When  the  class  7 ~t  of  functions  of  interest 
is  infinite,  however,  the  strong  law  by  itself  will  not  suffice  to  guarantee  uniform 
convergence  over  the  whole  class.  There  is  a  stronger  version  of  this  notion,  however, 
called  uniform  SLLN  which  is  the  only  convergence  concept  needed  for  our  purposes 
and  can  be  found  in  Pollard  [21].  For  the  MLE  and  kernel  methods  it  is  necessary  in 
Sections  4.3  and  5.2  to  have  convergence  for  a  whole  uncountable  class  7 i.  of  functions. 
In  technical  terms  the  uniform  SLLN  is  expressed  as 

1  " 

sup  —  £  h(xi)  —  E h  — >  0  a.e. 

AgW  n 

Such  uniform  convergence  cannot  hold  over  all  classes  of  functions  —  it  is  easy  to 
construct  instances  of  classes  for  which  such  convergence  fails.  The  whole  game  is 
hence  to  identify  classes  H  (as  generally  as  possible)  for  which  we  can  assert  such  uni¬ 
form  convergence.  To  facilitate  the  understanding  of  how  such  uniform  convergence 
over  a  class  of  functions  is  proved  and  to  introduce  the  basis  for  the  finite-sample-size 
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results  that  will  be  exhibited  subsequently,  we  choose  to  express  this  notion  with  the 
following  statement 

P  fsup  —  V] /&(#,)  —  E h(x)  >  e J  <  8  (3.1) 

\h€H  n  J 

for  any  e,  8  >  0  and  for  all  large  enough  n  (which  may  be  a  function  of  8  and  e).  (Note, 
since  we  may  write  8  =  8{n,e),  then  if  T,™=i8(n,e)  converges,  then  by  the  Borel- 
Cantelli  Lemma,  (cf.  Chung  [19])  the  above  definition  of  uniform  a.e.  convergence 
will  follow;  in  the  theorems  to  follow,  this  will  be  the  case  because  8(n,  e)  =  e-n^nh) 
If  the  class  hi  is  finite  with  cardinality  A  ,  then  the  regular  SLLN  can  guarantee 
this  convergence  for  every  h  €  hi  since  then  the  sup  becomes  a  max;  in  particular,  a 
single  application  of  Boole’s  inequality  (the  union  bound)  gives 

p(niM  -  EM*)  >«)  <  KS 

which  can  be  made  arbitrary  small  since  8  >  0  is  arbitrary.  However,  if  the  class 
H  has  infinite  cardinality  (as  is  the  case  of  the  MLE  and  kernel  methods),  then  the 
regular  SLLN  in  conjunction  with  Boole’s  inequality  do  not  suffice  to  ensure  uniform 
convergence.  It  is  necessary  in  such  a  case  to  approximate  the  class  W  by  a  finite 
collection  of  functions  {hi,  h2,...,  hcov( «)},  called  a  finite  covering  for  %,  such  that 
every  h  €  7i  is  “close”  (in  some  metric-sense)  to  at  least  one  function  hj  in  the 
covering.  The  integer  cov{H )  is  the  minimum  cardinality  of  such  a  covering  and  is 
called  the  covering  number  of  hi.  Then  applying  the  SLLN  uniformly  over  hj,  1  < 
j  <  cov(H),  (together  with  some  technical  details),  results  in  uniform  convergence  for 
the  whole  class  hi.  In  the  MLE  method  we  explicitly  determine  the  covering  number. 
In  some  situations,  such  as  in  the  kernel  method,  it  is  easier  to  use  a  bound  for  the 
covering  number.  The  covering  number  cov(hi)  is  bounded  by  a  polynomial  whose 
degree  is  the  celebrated  quantity  called  the  Vapnik-Chervonenkis  (VC)  dimension 
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denoted  as  VC (Tt).  The  VC-dimension  VC (Tt)  influences  the  upper  bound  (3.1)  on 
the  probability  of  e-deviation  from  the  mean  and  hence  is  an  important  quantity 
when  determining  the  sufficient  value  of  n  for  which  this  probability  is  at  most  8.  We 
now  proceed  with  several  definitions  and  theorems  relating  the  VC-dimension  to  the 
covering  number  and  apply  them  to  get  sample  sizes  that  are  sufficient  for  uniform 
SLLN  convergence. 

3.2  Uniform  Strong  Laws 

The  following  theorems  and  definitions  can  be  found  in  Pollard  [21],  Dudley  [23],  and 
Haussler  [12].  Let  X  denote  some  universal  set. 

Definition  3.2  Given  a  class  C  of  sets  in  X ,  and  a  set  S  C  X,  denote  by  ncC1^) 
the  set  of  all  subsets  of  S  that  can  be  obtained  by  intersecting  S  with  a  set  in  C,  that 
is,  ric(‘S')  =  {-S' D c  :  c  G  C}.  The  VC-dimension  of  C,  denoted  by  VC(C)  is  defined 
as  the  cardinality  of  the  largest  set  S  C  X  such  that  Iric^)!  =  2^ .  (Define  VC(C) 
=  oo  if  the  property  holds  for  S  unboundedly  large.) 

In  words,  the  VC-dimension  of  C  is  the  largest  cardinality  of  a  set  S  of  points,  all  of 
whose  subsets  can  be  obtained  by  intersecting  S  with  sets  in  C. 

Example:  Let  C  be  the  class  of  all  finite  intervals  on  the  real  line.  When  |5|  =  1 
then  \Uc(S)\  =  2.  When  \S\  =  2  it  is  4.  When  \S\  >  2,  |nc(S')|  <  2|s|.  Hence, 
VC(C)  =  2. 

Example:  Let  C  be  the  class  of  all  two-fold  unions  of  intervals  on  the  real  line. 

W.l.o.g.  take  a  set  S  of  points  aq  <  a:2  <  x3  <  x4,  i.e.  |6’|  =  4.  From  the  previous 
example,  it  is  clear  that  when  |S|  =  4  then  |nc(5)|  =  16  because  we  can  find  intervals 
that  achieve  (by  intersection  with  S)  all  4  possible  subsets  of  {.'r1,x2}  and  intervals 
that  achieve  all  4  possible  subsets  of  {.T3,aj4}.  Taking  these  intervals  in  pairs  gives  us 
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the  16  possible  subsets  of  {x1?  x2,  X3,  X4}.  So  VC(C)  >  4.  Now  try  any  set  S  with  5 
points  {xi,x2,X3,X4,x5}  with  xi  <  x2  <  x3  <  x4  <  x5.  There  do  not  exist  any  pairs 
of  intervals  that  can  achieve  the  subset  {{xx,x3,  x5},  {x2,  x4}}.  Thus  VC(C)  =  4. 

Example:  A  neuron  (or  a  linear  threshold  element)  with  N  inputs  may  be  repre¬ 
sented  by  an  iV-dimensional  hyperplane.  The  number  of  dichotomies  of  a  set  of  m 
points  that  such  a  hyperplane  can  separate  is  given  by  the  quantity 

N 

2£  (-71) 

j- 0 

which  equals  2m  if  and  only  if  m  ^  N  4-  1.  This  is  the  celebrated  result  of  L.  Schlafli 
[46].  Hence  the  VC  dimension  of  the  class  of  all  neurons  with  N  inputs  is  equal  to  N . 

Theorem  3.3  Given  any  set  S  of  cardinality  m  >  0  and  a  class  C  with  VC(C)  = 
d  >  0,  then  Ilc(S)  <  Ej=o  (1)  ifm>d  and  lie  (5)  =  2m  otherwise. 

For  our  purpose,  we  will  use  the  fact  that  for  m  >  2  and  d  >  2,  the  sum 
J2j=o  (7)  -  rn‘d  so  ^a.t  ncC^')  <  in  consequence. 

Definition  3.4  The  graph,  of  a  real-valued  function  f(x)  on  a  set  X  is  defined  as  the 
subset  Gf  =  {(x,i/)  :  0  <  y  <  /(x)  or  f(x)  <  y  <  0}  of  X  x  1R,. 

A  figure  of  a  graph  of  a  function  is  displayed  in  Figure  3.1. 

Definition  3.5  The  VC-dimension  of  a  class  TL  of  real-valued  functions  h(x)  on  X 
is  the  VC-dimension  of  the  class  of  sets  that  are  graphs  of  the  functions  in  Tt. 

Theorem  3.6  Let  the  class  TL  be  a  d-dimensional  vector  space  of  real  valued  functions 
from  X  to  1R,N ,  i.e.,  the  functions  h(x)  are  linear  combinations  of  some  basis  set 
{(j)i(x),  </>2(x), . . . ,  <j)<i(x)}-  Then  the  class  of  sets  of  the  form  {x  €  X  :  h(x)  >  0}  has 
VC-dimension  =  d. 
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Figure  3.1: 


Example:  Let  Li  be  all  functions  h(x)  of  the  form  h(x)  —  «i  <f>\  +  «2  <j> 2  +  a 3^3 
where  fa,  <f>s}  =  {  1,  x,  x2  }.  Then  VC("H)  =  3. 

Example:  Let  Li  be  a  class  of  sets  of  the  form  {.t  :  |x  —  9\  <  l,x,d  €  IR^}.  Then  we 
can  express  such  sets  by  {a;  :  g$(x)  >  0}  where  gg(x)  =  —  YliLi  X1  ~ |^|2+2I^ili  £»0;+l- 
Clearly  g  is  a  linear  combination  of  the  basis  {l,xi,  x\, . . . ,  xn,  #)v}.  Hence  VC(?i) 
=  2JV  +  1. 

Definition  3.7  (Covering  number)  Let  Q  be  a  probability  measure  on  X  and  let 
Li  be  a  class  of  functions  in  Cl(Q),  i.e.,  Eq(|/i|,)  <  00.  For  each  e  >  0,  the 

covering  number  Af  (e,  Li,  Lq)  is  defined  as  the  smallest  value  of  k  for  which  there 
exist  functions  </i,</2>  •  •  •  >  ( not  necessarily  in  Li)  such  that  min,-  Eq |  h  —  gfi  <  e 

for  each  h  €  Li.  If  no  such  k  exists,  then  Af(e,Li,  Lq)  =  00. 

As  mentioned  earlier,  uniform  convergence  is  achieved  by  first  approximating  a  class 
of  functions  by  a  finite  covering.  This  introduces  the  covering  number  into  the  bounds 
on  the  deviation-probability.  With  the  next  theorem  (from  Pollard  [21]  and  Haussler 
[12])  it  is  possible  to  replace  the  covering  number  in  these  bounds  by  a  quantity  that 


involves  the  VC-dimension  (which  may  sometimes  be  easier  to  calculate).  This  was 
done  to  obtain  Theorem  3.9  and  Theorem  3.10.  The  definition  of  permissibility  can 
be  found  in  Pollard  [21],  and  is  a  regularity  condition  guarding  against  some  possible 
measurability  difficulties;  basically,  if  a  class  of  functions  can  be  shown  to  be  indexed 
by  some  parameter  that  lives  in  a  compact  metric  space,  then  it  is  possible  to  exhibit  a 
finite  covering  for  this  class.  In  our  applications,  i.e.,  maximum-likelihood  estimation 
and  kernel-density  estimation,  we  explicitly  show  the  existence  of  a  finite  covering  for 
the  particular  function  class  that  is  used. 


Theorem  3.8  Let  H  be  a  permissible  class  of  functions  from  X  to  the  interval  [0,M] 
with  VC(H)  =  d  for  some  1  <  d  <  oo.  Let  Q  be  any  probability  measure  on  X. 
Then  for  all  0  <  e  <  M , 


The  following  is  Corollary  2  in  Haussler  [12]. 


Theorem  3.9  (the  ls<  uniform  convergence)  LetTL  be  a  permissible  class  of  func¬ 
tions  from  X  into  a  bounded  interval  [0,M]  with  VC(H)  =  d  for  some  1  <  d  <  oo. 
Assume  n  >  1,  and  draw  a  random  n-sample  independently  according  to  any  distri¬ 
bution  Q  on  X .  Then ,  for  all  0  <  t  <  M , 


P 


(sup 


-  £  h(Xi) 

11  i=  1 


E  h{x) 


<  8  log  32 tMj' 


where  P  denotes  probability  measure  corresponding  to  independent  sampling  according 


to  Q.  Moreover,  for 


n  > 


64  M2 


16  eM 
e 


this  probability  is  at  most  6. 
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The  following  is  based  on  Theorem  37  in  Pollard  [21].  The  idea  is  that  even 
if  a  class  H  depends  on  the  sample  size  n  (as  will  be  the  case  when  we  deal  with 
kernel  density  estimation)  then  it  is  still  possible  to  have  uniform  convergence  for 
the  empirical  means  of  functions  h  £  'H  t o  their  true  means.  In  Pollard  [21]  the 
condition  is  for  functions  in  the  class  to  have  magnitude  bounded  by  1.  We  stated 
the  result  here  under  the  condition  \h\  <  M  for  every  h  £  H.  Note  that  the  result  is 
distribution-free. 


Theorem  3.10  (the  2nd  uniform  convergence)  For  each  n,  let  Hn  be  a  permis¬ 
sible  class  of  functions  whose  covering  number  is  bounded  as  in  Theorem  3.8  and  the 
constants  M  and  d  do  not  depend  on  n.  Suppose  xi,...,xn  are  obtained  by  indepen¬ 
dent  sampling  from  an  arbitrary  probability  distribution  on  X .  If  \  h  |  <  M  and 

E (li2)  <  5^  for  each  h  £  7 in  where  8%  satisfies  loS^/n.  — »  0,  then 


P  (  sup 
\h€-H„ 


-EM*.-)  -  EM*) 


nl= i 


8l\  .  ..  f 32eM2 
>  e  -J7  <  24  - —  log 

M)  V  epn 


and  the  RHS  — >  0  faster  than  any  power  of  n. 


32 eM2  \  — n  £*5*/8192M2 

) 


The  fact  that  the  bound  goes  to  0  faster  than  any  power  of  n  ensures  a.e.  con¬ 
vergence  (via  the  Borel-Cantelli  lemma.)  therefore  achieving  uniform  convergence  of 
empirical  means  to  expectations  over  the  whole  class  Hn  for  arbitrary  sampling  dis¬ 
tributions. 
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Chapter  4 

Parametric  Scenario 


In  this  chapter  we  discuss  several  variants  of  learning  a  classification  rule  with  para¬ 
metric  underlying  distributions.  The  overall  theme  is  to  determine  the  sample  com¬ 
plexities  of  learning  to  classify  when  the  form  of  the  densities  involved  is  known  to 
the  learner,  and  either  a  mixed  or  just  a  labeled  sample  is  available.  Following  the 
approach  of  the  Computational  Learning  Theory  field  (cf.  [34],  [12]),  we  measure 
the  sample  complexity  of  learning  by  the  number  of  examples  that  are  sufficient  to 
achieve  an  accuracy  e  >  0  in  learning  the  decision  rule,  with  a  certain  level  of  confi¬ 
dence  in  excess  of  say  >  1  —  6.  Hence  all  the  statements  of  learnability  that  we  make 
are  probabilistic  in  nature,  where  the  confidence  parameter  6  >  0  can  be  arbitrarily 
chosen. 

The  investigated  scenarios  are  limited  to  multi-dimensional  Gaussian  distributed 
pattern  classes  with  unit  covariances  and  the  theorems  are  stated  for  this  family  of 
problems.  It  will  be  quite  clear,  however,  that  the  analysis  techniques  pertains  to 
other  parametric  families  as  well,  albeit  resulting  in  different  constants  and  rates. 

We  start  in  Section  4.1  by  determining  the  sample  complexity  of  learning  only 
with  a  labeled  sample,  where  the  classification  problem  has  two  equiprobable  pattern 
classes.  The  learner  uses  algorithm  E,  based  on  moment  estimation,  to  construct 
a  decision  rule.  Tight  bounds  on  the  deviation  of  the  moment  estimates  from  the 


37 


true  values  yield  a  tight  sample  complexity  bound.  This  learning  scenario  represents 
a  state  in  which  the  learner  utilizes  the  labeled  sample  efficiently  and  has  access 
to  the  sufficient  statistics  for  estimating  the  class  conditional  densities.  The  cost 
of  learning  under  such  a  scenario  is  therefore  representative  of  the  minimal  cost  in 
terms  of  exploitation  of  information  under  this  scenario.  This  hence  provides  a  good 
reference  point  for  interpreting  the  cost  of  learning  with  a  mixed  sample.  In  particular, 
we  determine  the  reduction  in  the  labeled  sample  size  due  to  introducing  unlabeled 
examples.  This  establishes  the  tradeoff  between  labeled  and  unlabeled  examples  when 
the  parametric  form  of  the  densities  is  available  as  side  information. 

In  Section  4.2  the  case  of  a  mixed  sample  is  considered.  The  problem  is  the  same, 
i.e.,  two  equiprobable  pattern  classes  each  distributed  as  a  unit-covariance  multi¬ 
dimensional  Gaussian.  The  learner  is  given  randomly  drawn  unlabeled  and  labeled 
examples  and  uses  algorithm  M,  which  is  based  on  maximum  likelihood  estimation,  to 
construct  estimates  of  the  two  class  conditional  means  using  only  unlabeled  examples. 
These  are  then  used  to  construct  a  linear  decision  border  which  approximates  the 
Bayes  partition.  The  labeled  examples  are  used  in  this  approach  only  for  labeling  the 
two  regions  of  the  hyperplane.  As  expected,  the  mixed  sample  approach  uses  fewer 
labeled  examples.  The  reduction,  compared  to  the  purely  labeled  sample  approach, 
is  significant,  being  polynomial  in  the  dimensionality  of  the  feature  space  and  in  the 
accuracy  and  confidence  parameters. 

We  will  proceed  as  follows:  in  Section  4.1  we  state  Theorem  4.1  which  pertains 
to  learning  with  a  purely-labeled  sample,  then  preview  the  proof  before  providing  the 
actual  proof.  In  Section  4.2  we  state  Theorem  4.2  which  pertains  to  the  mixed  sample 
learning,  followed  by  a  preview  of  the  proof.  The  proof  is  given  in  Section  4.3.  The 
referenced  auxiliary  lemmas  are  included  in  the  proof.  In  Sections  4.4,  4.5  we  analyze 
the  same  classification  problem  as  the  previous  sections,  except  the  two  classes  have 
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Figure  4.1: 


different  a  priori  probabilities.  Discussions  of  the  results  are  deferred  to  Chapter  6. 

4.1  Purely  Labeled  Sample 


Here  the  parametric  form  of  the  jV-dimensiona.l  class  conditional  densities  fi(x),  /2(®)> 
is  known: 


/<(*) 


1 


.g-fk-Mo.l2 


i  =  1,2. 


(2ir)Nri ' 

We  denote  this  by  writing  fi(x)  —  g{x\poi)  and  f-i{x)  =  g(x p/02)-  The  only  unknowns 
are  the  two  mean  vectors,  po\  and  //02  -  The  Bayesian  decision  border  is  a  linear 
hyperplane  orthogonal  to  p,02  -  //0i  (see  Figure  4.1).  A  learner  is  given  only  labeled 
examples  drawn  according  to  the  mixture  f(x)  =\g(x\f.t0i)+29(x\fl02)-  Algorithm  E, 
based  on  moment  estimation,  is  used  to  determine  close  estimates  of  the  means,  with 
which  a  classifier  is  constructed. 

We  first  state  the  algorithm  then  we  state  the  theorem,  provide  a  preview  of  the 
proof  followed  by  the  proof  itself. 


Algorithm  E: 

The  setting:  Two  pattern  classes  with  underlying  Gaussian  mixture  density 

/(*)  =  ^9(x\Poi)  +  ^9(^02)- 

The  teacher  draws  labeled  examples  according  to  /(x )  by  choosing  class  “1” 
or  class  “2”  with  probability  |  and  then  drawing  according  to  the  selected 
class  conditional  density  g(x\poi)i  i  =  1,2. 


Given: 

Begin: 


End . 


mi  examples  labeled  as  “1”  and  m2  examples  labeled  as  “2”,  where  mi  + 
m2  =  m. 

1)  Let  the  mean  estimates  of  poi,  i  =  1,2,  be 

1 

M-E4  (i  =  1.2). 

where  we  denote  by  x\  the  kth  element  of  the  ith  example  x,-. 

2)  Let  the  decision  border  be  the  hyperplane  that  passes  through  the 
point  hi+h  anci  orthogonal  to  the  vector  fa  —  fa- 

3)  Label  the  two  decision  regions  across  the  hyperplane  by  the  subscript 
of  the  mode  estimate,  fa,  i  =  1,2,  on  that  side,  respectively. 


Theorem  4.1  Suppose  we  are  given  two  equiprobable  classes  which  are  distributed 
according  to  Gaussian  probability  densities  <7(x|/toi)  and  g(x\po2)>  with  means  poi  £ 
WL",  <  =  1,2,  and  unit  covariances.  For  small  e  >  0,  arbitrary  6  >  0,  given 


labeled  examples  and  n  =  0  unlabeled  examples,  algorithm  E  results  in  a  decision  rule 
with  a  classification  error 


PBayes  —  Fervor ( ^ ^  Pbu j/es(l  "h  ^0 

with  confidence  at  least  1  —  6,  where  c  >  0  is  a  constant  depending  only  on  the  distance 
between  the  means. 
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We  first  preview  the  proof.  The  aim  is  to  estimate  the  Bayes  decision  border 
(which  is  a  hyperplane  orthogonal  to  //0i  —  ^02)  by  a  hyperplane  that  is  orthogonal 
to  the  difference  of  the  two  sample  averages,  fa  —  fa  and  which  passes  through  their 
midpoint. 

The  teacher  draws  labeled  examples  from  the  mixture  f(x).  We  first  establish 
the  sufficient  sample  size,  mi,  of  examples  from  class  “1”  for  which  fa  will  be  e-close 
to  ^oi-  The  estimate  m2  will  be  identical.  Then  we  find  an  upper  bound  on  m  such 
that  the  number  of  “1” -labeled  examples  is  at  least  mi  and  the  number  of  “2”-labeled 
examples  is  at  least  rn2  with  high  confidence. 

To  obtain  an  exponentially  small  bound  (w.r.t.  sample  size)  on  the  deviation 
between  each  mean  and  its  corresponding  sample  average,  we  utilize  the  Chernoff 
bound  (cf.  Papoulis  [43]),  which  is  a  variant  of  Chebyshev’s  inequality.  This  bound 
uses  the  moment  generating  function  and  hence  can  be  easily  specialized  to  a  normal 
random  variable.  It  gives  a  high  confidence  for  both  sample  averages  to  be  e-close 
to  their  respective  means  with  m  as  above.  Then  we  analyze  the  classification  error 
of  the  resulting  decision  rule  by  finding  the  worst-case  (error-wise)  deviation  of  the 
hyperplane  from  the  Bayes  optimal  hyperplane.  As  a  consequence  of  having  a  linear 
decision  boundary  (hyperplane)  the  discriminant  function  which  represents  the  deci¬ 
sion  region  becomes  a  univariate-normally  distributed  random  variable.  This  leads 
directly  to  a  bound  on  Perr0r  which  depends  on  the  above  e  and  hence  is  valid  with  the 
above  confidence,  given  that  m  is  as  stated  in  the  theorem.  Note  that  the  constants 
in  the  theorem  are  not  the  best  possible  and  can  doubtless  be  improved. 

Proof: 

The  aim  is  to  show  that  P({|/'i  -  /'il  >  f.}  U  {\fa  —  /12I  ^  e})  is  at  most  6 
when  m  =  m\  +  m2  is  as  specified  in  the  theorem. 

First  we  determine  the  sufficient  sample  size  m.\  from  class  “1”.  We  use  [iik  to 
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denote  the  kth  element  of  the  vector  px,  and  denote  by  x\  the  kth  element  of  the  iih 
example  x,.  To  have  |pi  —  //, 1 1 2  >  e2  it  is  necessary  that  at  least  one  of  the  A 
components  (plfc  —  plfc)2  >  e2/A,  for  k  =  1 . . .  A.  So,  as  a  consequence  of  Boole’s 
inequality, 

N 

P  (|#Ti  -  Pii  >  e)  <  (l£i*  -  Mi*  I  >  e/VN)  ■ 

k=  1 

Now  since  x  is  a  vector  distributed  as  N(pi,  I),  then  each  component  Xk  is  distributed 
as  N (fiik,  1)  hence  the  sample  mean  estimate  put  =  ^  Y?=\  x\  1S  normal  with  mean 
put  and  variance  Now  note  the  useful  elementary  bound 

P(z>A)  <  inf  e~sA  E(e*2). 

V  s>  0 

Let 

1  TOl 

z  =  —  X)  xk  “  M* 

and  A  =  e/y/N.  Simple  algebra  upper  bounds  P (z  >  >1)  by  e  f2mi/2JV .  And  hence 
P  (  —  E4  ~M*  >  <  2e~e2mi/2N. 

V  mi  i=i  / 

So  from  above, 

Pdpx-Pil  >  e)  <  2  Ne-e*mi/2N  =  <5/2. 

It  follows  that  the  sufficient  sample  size  is 

m\  >  (2  A/e2)  log  (4  A/6). 

Repeating  the  same  argument  for  class  “2”  we  have  that 

m2  >  (2  A/e2)  log  (4  A/6) 

is  sufficient  for 

P(|p2  -  p2|  >  e)  <  2Ne~t2m^2N  =  6/2. 
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Now  we  find  the  sufficient  m  so  that  the  number  of  class  “1”  labeled  examples 
is  at  least  mi  and  similarly  with  class  “2”.  Let  p  be  the  frequency  of  “l”-labeled 
examples,  i.e.,  p  =  ^  YT=\  where  ?/,•  is  the  label  of  the  ith  example.  We  have 

mi  mi  mi 

p  ~  p  -  e i  \  -  ei 

where  is  a  given  constant  representing  the  allowed  deviation  between  the  mean  and 
the  average  of  a  binomial  random  variable.  For  this  we  have 

P(|p-p|>ei)<2e-2m£U<5i 

hence  it  suffices  to  draw 

.  1  ,  2 
m-24'°sTl 

labeled  examples  to  have 

\p-p\<  ei 

with  confidence  >  1  —  6\.  Therefore  the  overall  m  sufficient  for  obtaining  mi  “1”- 
examples  with  confidence  >  1  —  8\  is 

Repeating  the  above  argument  for  “2”-examples  (where  m2  =  mx  —  log  )  and 
combining  we  get  that  the  sufficient  m.  for  obtaining  ?/?.i  “l”-examples  and  m2  “2” 
examples  with  confidence  >  1  —  2<5i  is 

m>max|^L.,^Llog2}. 

To  simplify  the  bound  we  select  ei  so  that  the  two  terms  inside  the  max  are  equal. 
Then  substituting  for  ei,  and  replacing  both  8  and  28i  by  8/2  we  obtain  that  with 

4 N ,  f8N 
.  m>—  log 
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we  have 


P({|£i~/*i|  >  e}  U  {Ifa-Pil  >  e})  <  & 

Now  we  aim  to  show  that  if  there  is  an  e-deviation  for  each  of  the  estimates 


from  the  true  means  then  it  results  in  the  claimed  error.  We  have  two  N- dimensional 
Gaussians,  fi(x)  with  mean  fi\  and  /2(x)  with  mean  p2  and  a  hyperplane  orthogonal 
to  —  pq,  passing  through  their  midpoint.  First  translate  the  Gaussians  so  that  the 
line  between  // 1  and  /<2  is  on  first-dimension  axis  and  the  origin  is  equidistant  from 
both.  Let  A  =  (l/ij  — //2|)/2  and  let  u+  =  [A,  0, . . . ,  0]T  and  u~  =  [— A,0, . . . ,  0]r. 
Now  consider  the  decision  rule  that  the  hyperplane  gives;  denote  it  as  h(x).  Clearly 
h(x)  =  (//2  —  ji\)T  ( x  —  fii)  —  |  |/i2  —  (i i|2  and  the  region  where  h(x)  >  0  is  classified 
as  class  “2”,  i.e.,  the  decision  point  is  at  h  =  0.  The  vector  x  is  joint  Gaussian  and  so 
h(x),  which  is  just  a  linear  transformation  of  x  is  a  one-dimensional  Gaussian  random 
variable  conditioned  on  the  high  probability  event  that  the  estimates  /It,-  are  e-close  to 


Hi.  Letting  g(x)  =  h(x)/ \p2  -  /q|  yields  p^g)  ~  N( 


(£2-Ai)r«  4-  ^(lAi I2 — IA2 12) 
IA2-A1I 


,  1)  and 


p2(g)  ~  N (— ~ — )  U|^2+_|| — ,  1).  The  decision  point  is  at  g  =  0  and  it  is  away 
from  the  Bayes  border  of  these  two  one-dimensional  distributions  by  The 

configuration  of  /q  and  /*/2  that  gives  a  good  upper  bound  on  the  probability  of  error  is 
achieved  by  minimizing  the  distance  between  the  means  and  maximizing  the  distance 
from  the  border  to  0.  After  some  algebra  we  get  an  upper  bound 


P  <  -<t> 

x  error  _  ^  ^ 


- £ — ]  +  1$  (  e  A 

A  —  e  2  lA-e 


Approximating  this  expression  for  small  e  >  0  and  using  the  fact  that  $(— A)  = 
PBayes  we  have  Perror  <  P Bayes  +  <Ae2  =  Pj3ayes{l  +  c2e2),  for  some  positive  con¬ 
stants  cj,c2.  Replace  e2  by  e  both  in  this  bound  and  in  the  bound  for  m,  to  get  the 
claimed  statement  of  the  theorem.  1 
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We  now  proceed  to  mixed-sample  learning,  with  side  information  of  the  para¬ 
metric  form  of  the  mixture  density  of  the  classes. 


4.2  Mixed  Sample 

In  this  section  we  consider  the  problem  of  classifying  two  pattern  classes  with  equal  a 
priori  probabilities  each  distributed  according  to  a  multi-dimensional  unit-covariance 
Gaussian  density.  Both  unlabeled  and  labeled  examples  are  drawn  according  to  the 
mixture 

f(x\e „)  =  i/,(*  |M  +  i/2(i|«o2) 

where  /i,  /2  are  Gaussians  with  means  60i,  i  =  1,2,  respectively  and  unit  covariance 
matrix.  (Here  0O  =  [#oi)#02]>  and  we  use  two  functions  /i,  /2  since  it  will  enable 
us  to  drop  the  parameters  0Ol ;  for  brevity.)  The  mixture  is  indexed  by  the  unknown 
vector  0  =  [#oi»0o2]  in  a  class  of  multi-dimensional  Gaussian  mixtures.  This  class  is 
identifiable  and  hence  if,  using  unlabeled  examples,  we  estimate  the  unknown  mixture 
f(x\0)  by  some  other  function  /(.r|0)  in  this  same  class,  then  it  will  uniquely  identify 
two  class  conditional  densities,  /i(a:|#i),  /2(z|#2),  #  =  [#i,#2],  whose  Bayes  decision 
regions  approximate  the  optimal  unknown  decision  regions.  The  latter  is  a  hyperplane 
orthogonal  to  #0i  —  #02  and  passing  through  their  midpoint.  If  #01  =  #02  then  any 
decision  rule  with  regions  i?*,  f?,2,  in  particular  any  hyperplane  going  through  the 
point  #01,  yields  PeTT0T  =  P Bayes  =  Thus  in  that  case,  the  fact  that  the  hyperplane 
cannot  be  identified  is  insignificant. 

We  use  algorithm  M  which  is  based  on  Maximum  Likelihood  Estimation  (MLE) 
with  a  mixed  sample  to  construct  a  decision  rule  which  has  a  Perror  close  to  P Bayes- 
The  n  unlabeled  examples  are  used  to  find  the  point  0  which  maximizes  the  likelihood 
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Figure  4.2: 


function 

L(6\xi,...,xn)  =  -J2logf(xi\0). 
n  i= 1 

By  taking  n  sufficiently  large,  0  is  guaranteed  with  high  confidence  to  be  e-close  to 
the  unknown  60.  (The  maximum  likelihood  estimation  principle  is  discussed  below). 
This  implies  that  0,-  is  e- close  to  0Oi,  i  =  1,2  and  a  hyperplane  is  constructed  as  the 
decision  border  estimate  (see  Figure  4.2).  The  algorithm  then  uses  the  m  labeled 
examples  with  the  majority  rule,  assigning  to  each  of  the  two  regions  the  label  of  the 
majority  of  the  examples  that  fell  in  it.  The  sample  size  m  is  taken  sufficiently  large  so 
that  with  high  confidence  the  labeling  having  the  minimum  error  is  picked.  Because 
the  majority  rule  results  in  an  exponentially  small  bound  (in  m)  on  the  probability 
of  mislabeling  the  regions,  the  sample  size  m  is  very  small,  and  in  particular,  is 
independent  of  N  and  e. 

The  main  reason  that  we  chose  the  MLE  for  the  unsupervised  estimation  pro¬ 
cedure  is  for  its  direct  coupling  with  the  uniform  SLLN  principle.  As  mentioned  in 
Chapter  2,  this  induces  a  clear  notion  of  cost,  through  finite  sample  complexities, 
which  is  what  we  seek. 

We  now  provide  a  brief  review  of  the  MLE  principle. 


4.2.1  Maximum  Likelihood  Estimation:  A  Review 

The  method  of  Maximum  Likelihood  Estimation  (MLE)  was  first  proposed  by  the 
German  mathematician  C.F.  Gauss  in  1821.  However,  the  approach  is  usually  cred¬ 
ited  to  the  English  statistician  R.  A.  Fisher  who  first  investigated  in  1922  the  prop¬ 
erties  of  this  method  (cf.  Bickel  &  Doksum  [39]).  The  intuition  behind  this  method 
is  based  on  the  following:  consider  the  frequency  or  density  function  f(x\0 )  of  the 
random  variable  X  where  0  is  a  parameter  vector  in  a  subset  0  of  IR^.  Given  n  real¬ 
izations  xu...,  xn,  of  A",  drawn  independently  and  identically  distributed  according 
to  f{x\0o),  the  likelihood  function  L{0\xn)  is  defined  as 

L(0\xi...,xn)  =  —  ^2  log  f(xi\@)- 

n  i=i 

(We  will  sometimes  drop  the  dependence  on  the  sample  and  write  L(0).)  If  the  random 
variable  X  is  discrete  then  for  each  0  the  likelihood  function  represents  the  log  of  the 
probability  of  observing  the  sample  Zi,...,x„.  Thus  L(0\xi, . . .  ,xn)  represents  a 
measure  of  how  likely  0  is  to  have  produced  the  observed  sample.  The  method  of 
MLE  aims  at  finding  the  parameter  value  0  =  argsupee0L(^|z1, . . .  ,xn)  which  is  the 
most  likely  to  have  produced  the  given  sample. 

To  illustrate  this  method  consider  the  following  example.  Let  , . . . ,  xn  be  ob¬ 
servations  from  a  Gaussian  Ar(//o,  <72),  s.t.  0q  =  [/<o,  Cq]  €  0  where  the  parameter 
space  0  is  —  oo  <  n  <  oo,  0  <  a2  <  oo.  We  seek  an  estimate  0  of  0O.  Simple  algebra 
gives 

1 

L{0)  =  -n  log  <7 - — - 

and  the  maximum  likelihood  estimator  is 


,5 


where  x  =  ~  £"=1  x<  and  s2  =  i  -  -t)2-  By  the  law  of  large  numbers  this  ML 
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estimate  is  consistent,  i.e.,  x  — *  fio  and  -s2  — *  cr\  as  n  — *  oo.  In  general,  as  in  this 
example,  the  MLE  method  yields  asymptotically  optimal  parameter  estimators,  being 
asymptotically  efficient,  consistent  and  function  of  the  minimal  sufficient  statistic. 

However  there  are  some  possible  difficulties  with  this  method.  The  first  is  that 
L(9)  may  be  unbounded  over  0.  An  example  of  this  (cf.  Redner  &  Walker  [7])  is 
when  f(x\6)  is  a  mixture  of  two  Gaussians  having  the  same  variance  cFq  and  means 


The  likelihood  function 


L(9\xu...,xn)  =  ^^logi 

n  i=i  L 


can  be  unbounded  for  9  =  [/ij,  /t2, cr2]  in  the  limit  as  a  — ►  0  and  where  one  of  the 
means  Hi,  //2  coincides  with  an  example  Hence  if  0  is  — oo  <  fii  <  oo,  i  =  1,2, 
and  0  <  cr2  <  oo,  and  60  G  0,  then  9  will  not  yield  a  consistent  estimate  since  the 
argsupfl€6)L(0)  does  not  tend  to  90  as  n  — ►  oo.  There  are  some  ways  to  circumvent 
this  difficulty,  including  various  regularization  techniques  (cf.  Grenander  Ulf  [47]) 
in  which  the  parameter  space  0  is  allowed  to  change  with  n  such  that  the  singular 
points  are  contained  only  in  the  limit  as  n  — >  oo. 

We  note  that  the  function  L(9)  may  have  several  relative  and  global  maxima.  If 
the  density  f(x\9)  is  identifiable,  i.e.,  there  do  not  exist  two  different  parameters  9a 
and  9b  which  correspond  to  the  same  density,  then  under  some  weak  conditions,  it 
can  be  shown  theoretically  that  the  ML  estimate  is  consistent.  But  if  f(x\9 )  is  not 
identifiable  then  regardless  of  how  large  n  is,  L(9)  can  possibly  attain  its  maximum 
value  at  several  different  points  and  thus  one  is  left  with  no  clue  as  of  which  of  these 
points  should  be  chosen  to  be  the  estimate  of  the  unknown  parameter  9q. 


It  should  be  mentioned  that  even  if  the  density  f(x\9)  is  identifiable  and  L(9) 
is  well  behaved  then  still  finding  the  maximum  of  L(9)  can  be  costly  in  practice  and 
involves  some  type  of  a  global  optimization  technique  (cf.  Redner  &  Walker  [7]). 


The  MLE  method  has  a  robust  theoretical  basis.  There  is  a  vast  amount  of 
literature  about  this  method  from  both  experimental  and  theoretical  aspects.  From 
the  theoretical  side,  considerable  work  exists  in  proving  convergence  of  the  estimator 
9  to  the  unknown  9q  as  the  sample  size  n  — >  oo,  see  Wald  [8],  Le  Cam  [9],  Bahadur 
[10],  Huber  [11].  The  experimental  work  regarding  MLE  is  concerned  with  efficient 
algorithms  of  finding  the  global  maximum  of  the  likelihood  function  L(9). 


On  the  theoretical  aspect  of  the  MLE  principle,  a  brief  historic  overview  shows 
that  initially  in  1946,  Cramer  [35]  established  the  consistency  of  a  9  at  which  the 
likelihood  function  L(9)  has  a  relative  maximum.  This  however  is  not  strong  enough 
since  there  may  be  several  relative  maxima  and  one  cannot  know  which  of  these  critical 
points  to  select  as  the  estimator  of  90.  In  1949,  Wald  [8]  established  the  consistency 
of  the  global  maximum  of  L(9).  This  means  that  the  critical  point  at  which  L(9) 
achieves  its  highest  maximum,  should  be  chosen  as  the  estimator  of  9q ,  resulting  in  a 
consistent  estimate  of  90.  Wald  introduced  an  ingenious  method  utilizing  the  extremal 
properties  of  the  Kullback-Leibler  distance  function  with  a  uniform  SLLN.  His  method 
is  fundamental  in  the  subject  of  MLE  and  it  permits  remarkable  extensions  to  the 
case  of  infinite- dimensional  abstract  parameter  spaces  (cf.  Grenander  U.  [47]  Chapter 
7).  The  main  details  of  the  ML  principle  will  appear  in  the  proof  of  Theorem  4.2. 
Much  work  has  been  done  since  then  in  weakening  the  conditions  of  Wald’s  proof. 
This  includes  the  work  of  Huber  [11]. 
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4.2.2  Gaussian  Mixture 


In  the  previous  section  we  mentioned  that  the  MLE  principle  yields  consistent  param¬ 
eter  estimators.  Consistency  is  an  asymptotic  notion,  i.e.,  a  property  of  the  estimator 
as  n  — *■  oo.  Our  interest  is  not  in  asymptotics  but  in  a  finite  sample  size  n  required  for 
a  prespecified  estimation  error  when  using  the  MLE  method.  Using  that  the  trade¬ 
off  between  the  unlabeled  and  labeled  sample  sizes  for  learning  classification  can  be 
calculated. 

We  now  describe  algorithm  M  and  then  proceed  with  the  technical  details. 
Algorithm  M: 


The  setting:  Two  pattern  classes  with  underlying  Gaussian  mixture  density 

f(x\ea)  =  +  \f2MO02) 

with  0O  =  [001,  #02]  is  in  a  compact  set  0  of  H2JV.  The  teacher  draws  labeled 
and  unlabeled  examples  independently  according  to  f(x  |0O)  by  choosing 
class  “1”  or  class  “2”  with  probability  \  and  then  drawing  according  to  the 
selected  class  conditional  density  /t(x|0o;),  *  =  1,2. 


Given: 

Begin: 


End. 


m  labeled  examples  and  n  unlabeled  examples. 

1)  Find  a  point  0  €  lR2Ar  satisfying 

*  1  n 
0  =  argsupeee-  loS  /(*< I*)- 

n  ;=i 


2)  Select  as  separating  surface  the  hyperplane  that  passes  through  the 
point  and  orthogonal  to  the  vector  02  —  0i- 


3)  Label  each  of  the  two  decision  regions  separated  by  the  hyperplane 
by  the  label  of  the  majority  of  the  labeled  examples  in  the  region. 


We  state  the  following  theorem,  then  preview  its  proof,  followed  by  the  proof 
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itself. 


Theorem  4.2  Suppose  the  two  pattern  classes  are  distributed  according  to  N -dimensional 

Gaussian  probability  densities  fi(x\0oi)  ,  f2(x\0o2)>  both  with  unit  covariance  matrices 

and  unknown  means  Qqi,  6.02  where  9q  =  [^01,^02]  €  0  and  0  is  a  compact  subset  of 

jr2JV\  j'jien  for  small  e>  0,  arbitrary  8  >  0,  given 

N2  /  1  1\ 

„  =  Cl— (^Vbg^+lcg-) 


unlabeled  examples  and 


m  =  c2log- 


labeled  examples,  algorithm  M  determines  a  decision  rule  with  classification  error 


Perror  (?71-,  n)  ^  PBa.yesi^l-  ”1"  ^3^) 

with  confidence  at  least  1  —  8.  In  the  above,  ci,c3  >  0  are  constants  which  depend  on 
0O ,  c2  >  0  depends  on  P Bayes ■  All  constants  may  be  replaced  (with  a  slight  worsening 
of  the  bounds)  by  absolute  positive  constants. 

4.2.3  Preview  of  the  proof  of  Theorem  4.2 

We  now  outline  the  proof  (for  more  details  see  Section  4.3).  The  proof  can  be  divided 
into  three  sections:  first,  it  is  shown  that  with  n  as  above,  the  maximum  likelihood 
estimator,  0  =  [#i,02],  is  e-close  to  0O ;  consequently  the  hyperplane  estimator  is  close 
to  the  Bayes  hyperplane.  Secondly,  assuming  that  we  picked  the  good  labeling  (as 
in  Chapter  2,  there  are  only  two  labelings)  we  determine  the  classification  error  of 
the  resulting  regions.  This  involves  the  same  analysis  as  in  Section  4.1.  Thirdly,  we 
determine  a  sufficient  size  for  m  that  guarantees  with  some  high  confidence  that  the 
good  labeling  is  picked  by  .  the  majority- rule  on  each  of  the  two  regions.  As  this  is  a 
random  labeling  method,  it  influences  the  confidence  parameter  8  of  producing  the 
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overall  classification  rule.  This  labeling  method  yields  an  exponentially  small  (w.r.t 
m)  upper  bound  on  the  probability  of  choosing  the  bad  labeling  and  therefore  is 
superior. 

The  first  section  of  the  proof  is  more  involved;  it  entails  classical  techniques  in 
the  field  of  probability  and  statistics.  A  preview  of  the  first  section  of  the  proof  is 
now  provided. 

It  is  known  that  Gaussian  mixtures  f(x\9)  are  identifiable  (see  Section  4.3). 
Using  this,  the  problem  of  estimating  the  true  unknown  distribution  function,  f(x\90), 
becomes  one  of  estimating  only  the  true  parameter  90  given  that  the  learner  has  side 
information  about  the  parametric  form  of  f(x\9).  In  the  Gaussian  mixture  case,  this 
is  sufficient  for  estimating  the  optimal  classifier  since  Oq  alone  identifies  the  Bayes 
border. 

Given  n  random  unlabeled  examples  drawn  according  to  the  true  unknown  mix¬ 
ture  density  f(x\90 ),  the  aim  is  to  show  that  any  9  that  maximizes  the  function 

L(e)  =  ^£lo«/(*il®)> 

is  e-close  to  the  unknown  9o,  i.e. 


90-9 


< 


with  high  confidence  provided  n  is  large  enough.  The  approach,  which  is  based  on 
the  original  proof  of  Wald  [8]  (see  also  Rao  [22]),  is  to  show  that  L(9)  <  L{9q)  for 
all  9  in  the  parameter  space  0  such  that  1 9  —  90 1  >  e,  and  simultaneously  that  there 
exists  a  9a  with  \9a-90\  <  e  such  that  L(9a)  >  L(90).  So  by  calculating  argsupeT(0), 
the  learner  must  obtain  9  which  is  e-close  to  the  true  unknown  parameter  90. 

The  first  step  is  realizing  that  the  Kullback-Leibler  distance  between  two  densities 
/  and  g  defined  by 

E,log| 
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achieves  its  maximum  value  of  0  uniquely  when  /  =  g.  Then,  because  the  Gaussian 
mixture  is  identifiable,  it  means  that  not  only  is  there  uniqueness  in  the  the  space  of 
all  Gaussian  mixtures  but  also  in  the  parameter  space,  i.e.,  there  does  not  exist  a  9 
differing  from  90  at  which 

attains  the  value  0.  (In  the  proof,  we  will  drop  the  subscript  90  and  write  only  $(0), 
since  90  is  fixed  throught,  and  all  expectations  are  w.r.t.  f(x\90).) 

If  we  can  guarantee  that  the  empirical  Kullback-Leibler  function 


n  t= 1 


f(*i\0o) 


(4.1) 


which  equals 


m  -  l{9q) 


is  close  enough  to  <f>(0)  uniformly  for  all  9  €  0  then  it  is  not  difficult  to  see  that  $„(#) 
can  be  made  <  0  for  all  9  s.t.  \9  —  90\  >  e  for  arbitrarily  small  e  >  0.  This  in  turn 
implies  that  L(9)  <  L(90)  for  all  9  such  that  \9  -  90\  >  e.  It  is  also  necessary  to  show 
that  there  exists  at  least  one  point  9a  with  L(9a)  >  L(90 )  with  \9a  —  #o|  <  £•  Then  it 
follows  that  argsupQ.L(0)  must  be  e-close  to  9q — the  needed  result. 

These  two  demands  are  satisfied  with  the  help  of  the  uniform  SLLN  (Theorem 
3.9).  It  allows  us  to  guarantee  that 


|<M0)-$(0)l<£<^2, 


for  9  s.t.  1 9  —  0O|  =  €  where  the  constant  B$0  >  0  is  s.t.  4>(0)  <  —Bg0e2  for  such  9 
and  small  e  >  0.  For  all  such  9  we  therefore  have  L{9)  <  L(90).  Using  the  continuity 
of  L(9)  this  implies  that  there  exists  9a,  e-close  to  90 ,  which  is  a  maximum  (not 
necessarily  the  global  maximum)  of  L(9).  We  then  again  use  the  same  principle  to 


53 


show  that 


|$»(0)-$(0)|<«(e)  (4.2) 

which  makes  L(0)  <  L(0o)  for  all  0  s.t.  |0-0O|  >  z,  where  the  deviation  a(e)  is  within 
our  control. 

We  can  only  use  the  uniform  SLLN  over  a  class  of  uniformly  bounded  functions 
(over  x  £  IR^).  The  difficulty  here  is  that  the  function  log/(a:|0)  is  unbounded  over 
x  £  1Rn.  This  difficulty  is  finessed  by  use  of  a  truncation  argument  restricting  the 
class  of  functions  to  which  we  apply  the  uniform  SLLN  to  be  a  class 

{g(x\0)  =  f(x\0)lD(x)  :  0  £  Q} 

where  D  is  a  properly  selected  set  in  HN .  This  is  a  class  of  bounded  functions  so 
we  can  get  uniformly  small  deviations  between  the  empirical  and  true  means  of  such 
functions  with  high  probability.  To  get  such  deviations  over  the  complement,  Dc, 
which  is  not  compact,  we  must  properly  select  D  such  that  the  tail  of  the  Gaussian 
(i.e.,  the  underlying  probability  measure)  decreases  fast  enough  over  Dc  so  that  the 
expectation  of  log  f(x\0)loc(x)  is  negligible. 

It  is  crucial  to  find  the  necessary  deviation  needed  to  have  4>„(0)  <  0  for  all  0 
s.t.  \0-0o\  >  e  because  from  Theorem  3.9,  it  is  clear  that  this  deviation  has  a  direct 
effect  on  the  sample  size  n ,  i.e.,  appearing  in  the  form  of  in  the  expression  for 
n.  If  a(e)  decreases  exponentially  fast  as  N  increases  or  as  e  decreases  to  0  then  the 
number  of  unlabeled  examples  will  increase  exponentially  fast. 

The  last  part  of  this  analysis  shows  that  the  bound  a(e)  in  (4.2)  may  be  selected 
0(e2)  independent  of  N.  Thus  the  sample  size  n  stays  polynomial  in  N  and  in  K  The 
major  part  here  is  based  on  a  technique  that  takes  advantage  of  the  low  dimensional 
symmetry  of  the  Ar-dimensional  integrals  that  constitute  the  function  4>(0).  Once 
such  symmetry  is  identified,  it  follows  that  the  set  of  values  that  $(0)  takes  for 
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e  €  0  c  m2iV  remains  the  same  for  all  N  >  3.  This  establishes  a  sufficient  deviation 
c*(e)  which  is  constant  with  N  for  N  >  3. 

4.3  Proof  of  Theorem  4.2 

In  the  following,  all  expectations  are  taken  w.r.t.  f(x\90).  Initially  we  use  the  uniform 
SLLN  to  show  that  with  n  (as  above)  unlabeled  examples  it  is  possible  to  estimate 
the  true  parameter,  9q ,  by  6  to  within  small  deviation.  This  implies  the  decision  rule 
is  close  to  Bayes.  And  finally  we  calculate  the  m  that  guarantees  with  high  confidence 
the  labeling  of  the  decision  regions  correctly. 

The  parameter  space  0  C  since  in  our  case  9  consists  of  the  two  N- 

dimensional  mean-vectors.  Denote  by  90  the  unknown  parameter  which  determines 
the  optimal  decision  border.  The  likelihood  function  is  defined  as 

L{9)  =  X^log  f{xi\9) 

t=i 

where  xt  are  the  unlabeled  examples.  The  learner  calculates  the  value  of  9  which 
achieves  the  global  maximum  of  L(9);  call  it  9.  This  9  is  then  used  for  determining 
the  decision  rule  as  described  above.  Our  aim  is  to  show  that  9  is  e-close  to  9q.  First 
we  find  how  large  n  suffices  to  guarantee  that  there  exists  a  maximum  (possibly  a 
relative  maximum)  of  L(9),  inside  the  closed  e-ball  at  9o  (denoted  by  B(9o,e)),  i.e., 
that  there  exists  some  9a  <E  B(90,e)  such  that  L(9a )  >  L(90).  Then  we  show  that 
for  all  9  outside  this  ball,  L(9)  <  L(90).  This  will  imply  that  by  picking  the  global 
maximum  of  L(9),  the  learner  chooses  a  9  which  is  e-close  to  90.  Theorem  2  and 
Proposition  1  of  Teicher  [29]  together  with  Proposition  2  of  Yakowitz  [30]  imply  that 
mixtures  of  A^-dimensional  Gaussians  are  identifiable  hence  there  can  be  only  one 
unique  true  unknown  parameter,  90,  (we  disregard  the  vector  [0O2,  $oi]  which  differs 
only  in  the  permutation  since  the  decision  border  is  the  same  in  this  case). 
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Because  L(-)  is  continuous  in  0,  to  have  a  maximum  inside  B(0O,  e)  it  is  sufficient 
to  have  L(9e)  <  L{90 )  where  9t  €  dB(9o,t),  the  surface  of  the  ball  of  radius  e  at  90. 
We  also  use  e  to  denote  any  2iV-dimensional  vector  of  magnitude  e.  Hence  we  need 
E-U  log  <  0.  For  any  two  different  distributions  g(x )  and  h(x),  it  is  easy  to 

show  that  /<7(x)log  <  0.  Hence  Elog^fjf^  <  0.  (This  is  provided  that  both 

E\ogf(x\90)  and  Elog/(a;|0),  exist,  which  is  true  in  our  case  as  is  shown  in  Lemma 
4.3.) 


Lemma  4.3  For  a  Gaussian  mixture  f(x\9)  with  unit  covariances  and  a  priori  prob¬ 
abilities 


E  log 


/(»!*) 

/(*l«o) 


<  OO 


for  any  fixed  0O  and  9  in  IR,2iV. 


Proof:  It  suffices  to  show  that  E<?0  log/(a:|0o)  is  finite  for  any  fixed  9  and  0O  in 
JR™.  We  have 

/(*  l«)  =  i/i(x|«.)  +  I/2(x|»,) 

=  ~{2t)~n/2  (e-21^!2  +  e-§|*-«*P) 

2 

<  {2ir)-N/2  <  1 


for  all  x  6  1RN.  Consequently, 


|l°g  f{x\9)\  < 


iog^/iO^i) 


<  log  2  + 


x  -  9A 


It  follows  that 


Edo\ogf(x\0)  =  J  f(x\90)  log  f(x\9)dx 

<  J  f(x\6o)\\ogf(x\0)\  dx 

<  J  f(x\90)  ^log  2  +  -  ^  ^  dx 
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=  log 2  +  \j\x~  Oil2 fi(x\doi)  dx  +  \f\x~  02\2 h{x\e02)  dx 
<  oo 

as  the  two  integrals  on  the  right  hand  side  just  yield  finite  combinations  of  various 
second  moments  of  the  Gaussian.  I 


As  mentioned,  the  class  of  TV-dimensional  Gaussian  mixtures  is  identifiable  up 
to  permutation  of  the  two  parameters  of  the  marginals.  This  means  that  if  f(x\9)  = 
f(x\0o)  then  either  0  =  [0Oi,0o2]  or  6  =  [0o2,0oi]-  Now,  Elog  equals  0  if  and  only 
if  f(x\9)  =  f(x\0o)  (cf.  Cover  &  Thomas  [31]).  Hence  it  follows  that  if  0  /  [^01,^02] 
and  9  ^  [002,  #01]  then  E  log  <  0.  So  our  following  argument  will  prove  that 

there  exists  a  maximum  of  L(9)  either  e-close  to  0O,  be.,  [0oi ,  $02] ,  or  to  the  vector 
[002, 0oi  ]•  To  be  more  clear  we  will  only  mention  conditions  which  are  sufficient  that 
there  exist  a  relative  maximum,  and  later  that  there  exist  a  global  maximum,  e-close 
to  0O;  this  will  be  apparent  from  the  fact  that  the  uniform  SLLN  is  applied  only 
over  the  ball  {0  6  B{9 0,  e)}.  However  strictly  speaking  we  should  use  the  uniform 
SLLN  over  the  region  {0  €  -B(0O, e)  U  -B([0O2, 0Oi],  e)}  which  will  yield  the  existence 
of  a  relative  max  of  L(0)  either  e-close  to  [0oi,0o2],  or  to  the  vector  [0o2,0oi]-  And 
similarly  with  the  proof  for  the  global  maximum  of  L(0).  It  turns  out  that  the  sample 
complexities  are  practically  unaffected  by  this  notational  nuisance. 

4.3.1  Local  Maximum  of  the  Likelihood  Function 


We  would  like  to  have  with  large  enough  n,  and  with  high  confidence,  that 


sup 

8eB{0o,e) 


f{xi\9)  -  Elog /(.t|0) 


<  a/2. 


(4.3) 


Further  ahead,  we  utilize  the  uniform  SLLN  (Theorem  3.9)  to  achieve  this  once  we 
define  an  appropriate  function  class  which  satisfies  the  boundedness  condition  of  the 
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theorem.  Once  (4.3)  is  true,  it  implies  that 


sup 

Oe 


1  ^ 


-  log  f(Xi\0c)  -  E  log  f(x\0c) 


nu 


<  a/2 


and  that 


1  A 


-l>g/(x,|0o)-  Elog/(x|l9„) 

"i=i 

since  it  is  the  special  case  when  e  =  0.  These  imply 


<  a/2 


sup 


—  V'  lr.tr  /(Xi  l^e)  _  -p  1  _  /(Xl^e) 
„  in  \  Elog 


Tl  , 
1  =  1 


f(xM 


f(x\0o) 


<  a. 


Hence  to  get  £  l°g  /(^T)  <  0  it  is  sufficient  to  have  (4.3)  true  and  choose 


a(e)  =  inf 
v  '  ee 


Elog 


fix  m 


fix\°o)\ 

We  need  to  estimate  the  dependence  of  a(e)  on  e  and  show  that  a(e)  is  not  arbitrary 
close  to  0  for  a  fixed  e  >  0.  For  small  e  >  0  we  expand  E  log  in  a  2N- 

dimensional  Taylor  series  around  90  as  follows: 

El0®  /(x|?o)  =  j  fix\eo)^ogf{x\et)dx- J  f{x\eQ)\ogf{x\e0)dx 

The  first  term  becomes 

J  f(x\6o)\ogf(x\0e) 

r  2Nrd 

=  J  f{x\60)\ogf(x\e0)dx  +  J  f(x\8o)-QQ^\ogf(x\90)dx 

i  2  N  r)2 

+  5  E  ^//(.rlWj^-log/lxWrfx 

IjJ -  1  -J 


+  g . E  J f^grLaeZ lo* dx 


(4.4) 


iJk=l  „  deoiddojddok 

where  e'  is  on  the  line  between  0  and  e.  Lemma  4.4  shows  that  the  integral  of  the 
third  order  partials  is  bounded  by  some  constant  making  the  term  bounded  above  by 
cq€ 3  for  some  positive  constant  cq. 


58 


Lemma  4.4 


2  N 

E 

hj,k=  1 

for  some  constant  c  >  0. 


. ? .  £i£'e‘ / *  - ct 


dOoidOojddok 


(4.5) 


Proof:  First  work  on  the  integral;  denote  ^:f{x\^o)\ec,  =  fi ■  Then  from  page  61 
we  have 

d2  ,  f(  |fl  x ,  ffij-fifj 
logf(x\0o)\e  = 


d0Oid9oj  aJy  '  P 

and  hence  (denote  /o  =  f{x\0o)  and  /  =  f(x\0ei) 

d 3 


dOo,  d&o j  dOok 
ffij  ~  fif; 


log/(x|0o)|ee,  dx 


J  f(x\e o) 

-  P\Lkjrdi)tix  =  lh 
=  /*(*& 


flMi  +  /A,t  -  Mi  ~  tiM  -  VMfh  -  Ml 


/2  j  /* 

hfii  ±  ffijk  -  fikfj  -  fifjk  _  Jkffa  -  hfifA  j_ 

P  PJ 


Let  /  denote  any  one  of  the  two  class  conditionals,  i.e.  f(x\0t'i)  or  f(x\9e>2)  and 
polyT(x )  denote  any  polynomial  in  xif ..  .,xjv  of  degree  <  r.  Then  /,•  =  / polyi(x), 
fij  —  f  poly2(x),  and  fijk  =  f  poly3(x).  Hence  the  above  is  bounded  by 


P\p°hi'i{x) 

P 


.  / 1  7  /  \ i  ,  f2  M?/3(.t)|  , 

+  y  |poZy3(®)|  + - j2 -  + 


+ 


/2  \p°hj3p)\ 

p 

f2  \poly3{x)\  /3  |po/y3(g)h 


Recall  that  /  =  /(.'r|0e<i)/2  +  /(:r|0(:'2)/2  hence  1//  <  2//  and  so  the  above  is  bounded 
by  12  /  /0  |poh/3(x)|  d.r.  This  is  composed  of  a  finite  sum  of  products  of  terms  such  as 


— -L=  j  e  i1'  |.Xj|r  dx-i  for  0<  r  <3,  l<i<N 

v27T  J-oo 


dx 
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It  is  easy  to  show  that  such  terms  are  finite.  For  instance,  we  can  bound  the  3rd  order 
term  as  follows  (we  use  y  to  denote  any  one- dimensional  component): 


— 7=  [  e  2 3/2 lyl3  dy  =  — ==  /  e  *y  \y\3  dy -\ — 7=  f  e  *y  \y\5  dy. 

sfaLoo  m  y  yfc J\y\<A  y  y/2^  J\y\>A  m 

The  first  term  is  clearly  finite  since  the  integrand  is  continuous  and  over  a  closed  set. 
The  second  term  is  bounded  as  follows: 

—7=  /  e— 2 3/2 1 j/ 13  dy  <  —7=  f  e~v2/2c  dy  z=  f  e~z  dz 

y/2w  J\y\>A  \/2ir  J\y\>A  y2x  Jz2>A2/c 

where  using  the  fact  that  V?/  >  A  >  e1/2,  log  yjy2  <  log  A/ A2  hence  e~y  !2yz  <  e~y  /2c 
for  c  >  1/(1  —  67^);  therefore  it  suffices  to  let  A  >  e1/2  and  choose  c  accordingly. 
Finally,  the  last  integral  can  be  bounded  by  2 Nc2/A4  using  the  variance  of  a  chi- 
square  and  Chebyshev’s  inequality.  So  we  have  shown  the  integral  in  (4.5)  is  finite. 
Call  it  ctijk-  From  (4.5)  we  have 


Y  eie3ekdijk  <  Y  lc*c;l  Y  \ekank\  <  Y  \eiei\JY€lJYahk  = e Y  M  Y  Mol 

ij,k  i,j  k  ij  V  k  y  k  i  j 

<  e2  Y,  letc;|  <  ce3 


for  some  constant  c  >  0.  I 


For  the  other  terms  note  that 

f(x\e0)^-iogf(x\e0)  =  A-f(x\el>). 

Ouoi  OVoi 

Then  the  derivative  can  be  represented  as  a  limit  of  a  bounded  sequence  (since  f(x\60) 
is  differentiable)  hence  by  the  Lebesgue  bounded  convergence  theorem  we  can  take 
the  limit,  hence  the  derivative,  outside.  Doing  this,  the  term  with  the  first  order 
derivatives  becomes  zero.  Now  consider  the  terms  corresponding  to  the  second  order 
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derivatives.  We  have: 


d  d 


//w*»W<w0j 

=  j  f(x\0„)—  1 


log  f(x \0o)dx 


=  //(*« 


_a_ 

')(),„  \  f(x |#0) 

/(* l«o)5fcafc/(x|«o)  -  jfe/(*l»o)^/(*l«o) 


P{x\0o) 


dx 


=  -/ /(*l*o) 
d 


d9oif(X\0°)d9ojf(X\00) 

PP\6o) 


dx 


=  -E  I  ^-7  log  f(x\80)—log  f{x\0o)  j  =  -lij{0 o) 


where  Iij(60)  is  the  ijih  element  in  the  Fisher  information  matrix  evaluated  at  90. 
Thus  the  above  imply 


E‘°g/w!)  =  + 

The  first  term  on  the  right  is  —  ^B$0e2  where  B$0  is  a  constant  depending  on  9q  which 
is  positive  if  0O i  /  O02  since  for  such  a  90,  I{90)  is  positive  definite,  as  is  shown  next. 


Lemma  4.5  Given  a  mixture  f(x\90)  of  two  Gaussians  fi(x\90i)  and  y2 ( ^ I ^02 )  with 
means  0(n  /  0Q2  and  with  unit  covariance  matrices,  then  the  Fisher  Information 
matrix  I(6o)  whose  ijth  element  is 

E(^log/W9o)^los/W^ 

is  positive  definite . 


Proof: 

For  any  vector  u  /  0 
,T  3 


U 


In  =  Yfuiui  //(xl0o)^log/(x|0o)^log/(x|0o)^ 
=  J  MOo)  ^  Vift-  log  f{x\90)j  dx  >  0 
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which  follows  since 


(Z«.a|:l0s/(W 


'doQ  i 

is  not  identically  zero  over  the  probability  1  support  of  /(x|0o)  as  we  now  show.  The 
functions  in  the  set 

r  o 


d90i 


\ogf{x\e, 


>o)  :  1  <  i  <  2iV  j 


are  linearly  independent  when  90 1  ^  0O2  since  for  any  u  ^  0  we  question  whether 
there  is  a  solution  to 

2  N  a  t  N  i  i  IN  f 

51  «.■  ofl-  log  f(x\0o)  =  r  J2  -  eoiy-j  +  -  Yj  ui(xi-N  -  Ooi)-f  =  o. 

i= 1  °Voi  1  1=1  J  1  t=7V+l  / 

This  is  the  same  as  asking  if  any  function  in  the  set 

{xi  -  0Oi,  X2  -  002,  ...,XN-  0ON,  4-(®l  -  0O,AT+l),  ~  0O.7V+2),  •  •  •  ,  ~Z~{XN  ~  0O,2n)} 

J 1  /I  Jl 

is  a  linear  combination  of  the  others.  By  inspection,  as  long  as  0Oi  ^  002,  this  is  not 
possible  because  ^  is  a  nonlinear  function  of  the  x,-  —  9oi- 

So  for  90  —  [y,  z]  where  y  ^  z  and  ?/,  z  G  IR^,  we’ve  shown  that  the  Fisher  matrix 
/(0o )  is  positive  definite.  I 


T  j 

Now  from  Rayleigh’s  quotient  we  have  >  \min  >  0  because  all  eigenvalues 
of  a  positive  definite  matrix  are  positive.  This  implies  that  §? Ie  =  0(e2)  and  so  for 
any  e 

pi  |0«)  _  1-  D  2  .  3  ^  r, 

8  7mm  “  “2  '* 

therefore 

sup  E  log  |  =  ~-Bo0e2  +  c0e3  <  0 

o<  nx  |0o)  2 

for  small  enough  e,  and  Bg0  >  0. 
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(The  case  of  #01  =  &02  is  trivial  as  this  results  in  Perror  =  §•  Without  loss  of 
generality  we  henceforth  assume  0Oi  7^  #02  whence  B$0  >  0  strictly.  ) 

We  did  not  consider  the  dependence  of  the  right  hand  side  on  N  but  instead  only 
as  a  function  of  e  and  0q  since  the  left  side  is  invariant  for  N  >  3;  we  show  later  that 
this  is  true  more  generally.  Hence  we  have 

=  B$0e 2  +  Coe3  (4.6) 

and  with  this  selection  of  a  =  a(e),  (4.3)  will  force  log  <  0.  (Henceforth 

we  rename  B$0  as  Ci). 

We  now  estimate  the  unlabeled  sample  size  n  needed  to  guarantee  (4.3).  The 
uniform  SLLN  holds  for  a  class  of  functions  that  are  uniformly  bounded  (see  Theorem 
3.9).  Hence  define  a  function  class  Q  as  follows:  let  D  C  JR^  be  a  compact  subset  of 
the  probability  one  support  of  /(x)  and  denote  its  complement  as  Dc.  Let 

Q  -  {log/(x|0)l£»(x)  :  0  €  £(0o,e)}. 


a(e)  =  inf 


Elog 


f(x\et) 


f{x\9o) 


The  functions  in  Q  are  bounded  hence  we  can  use  the  uniform  SLLN  over  it.  Then 


P  (  sup 

\9eB(6o,() 


Elog/(x|0) - ^log/(xi|0) 


n  r=i 


=  P  (  sup 

\eeB(e  „,e) 


>  a/2 

J 

1  A 


f  f{x\e0)\ogf(x\9)dx - ^log/(xt^)l£)(xt) 

Jd  n  i=1 


+  /  /(x|0o)log/(x|0)dx  - -X)log/(xl|6')ljr)«:(xi) 

Jdc  ni= 1 

Using  Boole’s  inequality  we  upper  bound  this  by 


>«/ 2  • 


sup  /  /(x|6>0)  log  f(x\0)  dx  -  -  ^  log  f(xi\0)lD(xi ) 
b(6o,c)  Jd  n  t=l 


\0€B(eo 

+  P  (  sup 

VeB(iw) 


>  a/4 


[  f(x\0o)  log  f(x\0)  dx  -  -^log/(xt  |0)lDc(xt) 
JD'  n  i=1 


>  a/4 


(4.7) 
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The  first  term  of  the  sum  is  the  probability  of  uniform  convergence  for  functions  in 
Q.  Hence  Theorem  3.9  can  be  used  here  to  make  the  first  term  arbitrary  small.  The 
theorem  does  not  directly  apply  for  the  second  term  since  log/(x|0)  is  unbounded 
over  Dc.  However  this  term  can  be  made  arbitrary  small  as  we  show  next.  Our  aim  is 
to  choose  the  region  D  such  that  both  the  empirical  and  the  true  means  over  Dc  are 
small  (so  their  absolute  difference  must  be  small);  that  is,  we  do  not  need  the  SLLN 
to  guarantee  their  difference  is  small.  We  achieve  that  by  utilizing  the  rapid  decay 
of  the  Gaussian  outside  a  large  enough  sphere  centered  at  the  mean.  We  bound  the 
second  term  of  (4.7)  by 

P  f  sup  /  f(x\0o)\ogf(x\9)dx  +  sup  -^log/(x,-|0)lDc(xi)  >  «/4) 

\eeB(B0,e)  jdc  eeB(90,e)  n  t=1  / 

and 

sup  1/  f(x\0o)  log  f(x\6)  dx  <  [  f(x \0o)  sup  |log  f{x\9)\  dx. 

9£B(90,c) \JDC  JD'  9£B(90,e) 

Without  loss  of  generality  suppose  90 1  =  0  (rotational  symmetry  allows  us  to  translate 

the  coordinate  system).  We  choose 

D  =  {x  :\x\<A  or  \x  -  9m\  <  A } 


for  some  constant  A.  We  then  have 


f(x\90i)  sup  log-/i(*|0i)  dx 


9£B(9  0,e) 


I  f(x |0O)  sup  |log/(x|0)|  dx  =  /  j 

Jx£Dc  9eB(90,t)  J|x|>4n|:r-0o2|>A 

<  \  [  f{x\9oi)  sup  log^/ii 

l  7|*|>^n|a:-floal>^  9eB{90,e)  ^ 

+  \  f  f(x\0m)  sup  log  i/2 

1  J\x\>Ar\\x-90i\>A  9eB(90,e)  ^ 

<  \  !  f(x |0oi)  sup  !°g  \h(x\0i)  (lx 

4  J\x\>A  9£B(90le)  ^ 

+  \  f  /(*  1^02)  sup  |log/2(x|02) 

l  J\x-9ai\>A  6PB(9«.t)  ^ 


'  f(x\90)  sup  |log/(x|0)|  dx 

M>v4n|r-0O2|>A  9eB(9o,c) 


+  -  f{x \902)  sup  log-/2(x|02)  dx 

l  J|x|>An|r-fl02|>A  9eB(90,e)  £ 


9£B(0q  ,c) 


I\x—6o2\'>A 


f(x\0 02)  sup  -log/2(x |02)  dx 


9eB(9o,t)\6 
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\  I  ,  f(x\90l)  logi/<r(x)  dx 

l  J\x\>A  l 

+  \[  f{x\0o2)\og]-fa(x-6o2) 

.  £  J\x  —  9q2 \>A  £ 


\LAi2^^l2\H\u{x)\dx 


i\x\>a  (2tt)jv/2 

l‘°«  5«*  -  M  *• 


where  the  normal  density 


f  (x)  - - _ - e-H2/2<r2 

M >  (2xa*)N/2 

is  defined  (by  choosing  an  appropriate  a)  such  that 

|log/i(z|0i)|  <  |log/<T(x)| 

for  6  e  B(90,  e)  and  {®  :  |x|  >  A}.  (Recall  that  6X  denotes  the  first  N  components  of 
0.)  For  this  it  is  sufficient  to  choose  a  that  satisfies 


(2jr)*/: 


.g{  — (*!+£)  -*2— — *n}/21i_[j4oi...i0]  = 


1  |r|2/2<T2  | 

(2t ro-2)^/2  'x=[Afl °]' 


Later,  we  discuss  more  specifically  the  choice  of  a.  The  above  is  bounded  by 

La 2^e'W,,2(l0S2  +  f  l0S2^+2^|x|2)  *  (4-8) 

The  terms  of  (4.8)  can  be  bounded  by  the  tail  of  a  chi-square  distribution  (see  Rao 
[22]).  We  outline  how  this  is  done.  We  use  the  fact  that 


e~z /2  ^  <  - e- 

-  PN/ 2 


for  z  >9N  and  for  all  N  >  1  (since  \og9N/N  <  log 9/9  for  all  N  >  1).  So  we  have 


(2tt)^/2  J\x p>A2 


[  e~lx]2/2\x\\lx 

J\x\2>A2 


(2K)Nl2eNI2  J{x?>A 


/  e~M2/2e  dx  =  1  f  e-h|2/2  dx 

J\x\2>A 2  (27r)Ar/2  J \x\2>-& 
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for  A 2  >  9 N  and  N  >  1.  The  last  integral  is  equivalent  to  the  probability  that  the 
sum  of  squares  of  N  standard  normal  random  variables,  which  is  distributed  as  a 
chi-square  with  second  moment  2 N,  is  >  A2/e.  Hence  using  Chebyshev’s  inequality 
we  can  bound  this  integral  by  2Ne2/A4.  The  remaining  terms  of  (4.8)  are  simple  to 
bound  by  similar  arguments.  Finally,  let  us  return  to  the  choice  of  a.  The  condition 
A2  >  9 N  together  with  the  condition  on  a  (in  the  construction  of  /^(x))  yields: 
e-(A+c)*/2  -A*/ 2A  M  +  e)2  A2 

Letting  y  —  1  j  a2  we  have  |log  y  —  log(l /a).  Solving  for  y  by  bootstrapping  yields 


1/a2  >  2(1  +  b0(e/VN))  =  b. 


(where  bo,  bi  are  constants).  Thus  we  can  take  1  /a2  =  b\,  i.e.,  treat  it  as  a  constant 
w.r.t.  N  in  (4.8).  So  we  get  the  bound 


sup  [  f(x\90)  log  f(x\9 )  dx 

=  b (a*. A  JxeDc 


<  c^f/V  +  c,) 


9£B(e o,e)  \Jx&Dc 

(where  c2,  C3  are  positive  constants)  which  is  independent  of  90.  Denote  this  by  A. 
We  continue  to  bound  the  second  term  of  (4.7).  We  have 


P  (  sup 


[  f(x\9o)  log  f(x\9)  dx 

JxeDc 


+  sup 

9£B(9  0  ,e) 


U  i=l 


> 


a/4) 


<  P  I  A  +  sup 
\  9(zB(9o,t) 


—  X^log/(ari|<9)ljC>-(a:t) 


11  , 
t~l 


>  a/4 


<  P  (  sup  -^|log/(x;|0)|  lr>c(xt)  >  a/4  -  A  J 

o,e)  n  !=1  / 


< 


E  |supfl6BW)t0  J  E?=  1  |log/(xt|0)|  1  Dc(Xi) 
|a/4  -  A| 


the  last  step  following  from  Markov’s  inequality.  Now 

sup  -53  |log/(x,|0)|  luc(*,-)  <  SUP  |1°S  .f(xi\0)\  lj3c(®«)' 

6eB(B0,e)  n  i=1  n  i=1  eeB(e0,c) 
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Hence 

EsupegB(<,0|{)  |log/(x|fl)l  lgc(a;)  =  JxeDe  /(^l^o)supg6B(go>e)  |log/(s|fl)|  dx 
|a/4  —  A|  |or/4  —  A| 

<  — - 

“  a/4 -A 


~  4 


the  last  bound  achieved  by  selecting  A  <  or  equivalently  A2  =  c4^j— .  This  is 

the  choice  for  A  such  that  the  second  term  of  (4.7)  is  at  most  S/4. 

We  now  proceed  to  find  a  bound  on  the  first  term  of  (4.7).  Here  the  functions 
are  bounded  since  their  domain  D  is  bounded  so  that  Theorem  3.9  is  applicable. 
The  procedure  will  be  to  directly  find  an  upper  bound  on  the  covering  number  of 
this  class  instead  of  calculating  its  VC-dimension  for  bounding  the  covering  number 
as  mentioned  just  above  Theorem  3.8.  Then  we  can  use  the  analysis  in  Haussler 
[12]  and  obtain  a  bound  similar  to  Theorem  3.9.  The  class  Q  is  defined  with  the 
parameter  9  restricted  to  within  the  closed  ball  B(0o,e).  It  is  easy  to  calculate  an 
upper  bound  for  the  covering  number  of  this  ball  with  respect  to  the  Euclidean  norm, 
denoted  by  |  •  |.  From  Haussler  [12],  an  e'-cover  for  a  set  T  is  a  finite  set  C  (not 
necessarily  in  T )  such  that  for  all  x  €  T  there  is  a  y  €  C  with  \x  —  y\  <  e'.  The 
cardinality  of  the  smallest  e'-cover  for  T  is  called  the  covering  number  and  is  denoted 
by  Af(e',T,  |  •  |).  A  set  T  is  e'-separated  if  for  all  distinct  x,y  €  T,  \x  —  y\  >  e'. 
The  size  of  the  largest  e'-  separated  subset  of  T  is  called  the  packing  number  and  is 
denoted  by  A4(e',  T,  |  ■  |).  It  is  easy  to  see  that  A /”(e',  T,  |  •  |)  <  A4(e',  T,  |  •  |).  Consider 
an  e'-separated  set  with  size  M(e',  T,  |  •  |).  Put  around  each  point  a  sphere  of  radius 
e'  and  let  this  be  a  covering.  Suppose  it  is  not  an  e'-covering.  Then  there  exists  some 
point  x  whose  distance  from  any  of  the  points  is  >  e'.  But  this  would  increment  the 
size  of  the  e'-separated  set  by  one,  which  contradicts  the  fact  that  it  is  the  maximum 
e'-separated  set.  Therefore  there  exists  an  e'  covering  of  size  M(e',T,  |  •  |)  and  clearly 
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the  smallest  e'- cover  has  cardinality  smaller  than  it.  That  proves  the  inequality. 
Hence  it  suffices  to  find  the  packing  number  of  our  ball  B(0o,e).  The  volume  of 
a  2iV-dimensional  ball  is  k2jyr2N  where  k2N  is  a  constant  and  r  is  its  radius.  The 
number  of  e'-  balls  that  can  be  contained  without  overlap  in  B(60,  e)  is  therefore  at 
most  (k2N62N) / (k2Nel2N)  =  (j;)2N-  Clearly  this  is  also  the  maximum  number  of  2e'- 
separated  set  of  points  contained  in  B(0q,  e)  hence  is  equal  to  A4(2e',  B(6o ,  e),  |  •  |).  So 
therefore  it  follows  that  Af(e',  B(60,  e),  |  •  |)  <  (|f)2^.  We  now  proceed  to  find  a  bound 
on  the  covering  number  of  the  class  Q  with  respect  to  the  id-norm  (as  in  Definition 
3.7). 

Let  the  set  {#i,  02, ... ,  0Af(£',B(0o,e),||)}  a  covering  of  B(0O ,  e).  We  now  construct 
a  covering  for  Q.  Fixing  a  particular  we  show  that  any  function  g(x\0),  with 
\0  —  0i\  <  e'  is  (S'-close  to  g(x\0i)  in  the  sup^-norm.  Using  the  notation  for  such  a  0 
as  0fj  we  therefore  have  \0cn  —  On |  <  e'  and  \0t>2  —  0{2\  <  t'  and  so  On  =  0c>i  +  v\  and 
0x2  =  0t'2  +  v2  where  both  V\  and  v2  are  of  magnitude  <  e'.  Then 


sup  sup  U(®|^£»)  —  <7(x|0t)  =  sup  sup  log/(z|0£»)lD(x) -log/(x|0,-)lr>(x) 
et,  x 1  Bt,  ^tr*1  1 


xelR 


=  sup  sup  llog  (e  2I x  'h'd2  4-  e  2 1*  5‘'2H  1  z>(aj) 

-  log  +  e-§h-e,/2-v2l2)  lo(x)| 

=  sup  sup  |log  (e- ill'll2  +  e-2^-^'212)  _  log  (e“ 2  I2  +  e-fl*-<h/2-«2|2)| 

log  (e-2^12  +  e-^l2)  -  log  (e-2bi-"d2  +  e-|l(v2-v2|2^| 


=  sup 

y£Ei 


where  =  {y  :  y  =  [x,  z]  —  0ci,  x  €  D,  0 £<  €  B(0i,  e')}  is  a  compact  subset  of  IR2Ar  as 
is  now  shown  (the  subscript,  i  shows  the  dependence  on  0X).  A  metric  space  X ,  e.g., 
IR2;v  in  our  case,  is  compact  if  every  infinite  sequence  has  a  subsequence  converging 
to  a  point  in  X  (cf.  Royden  [32]).  It  suffices  to  consider  a  sequence  yn  G  Et  and 
prove  that  it  has  a  convergent  subsequence.  Since  yn  =  [x„,  —  0cin  is  in  Ex  then 
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xn  €  D.  But  D  is  compact  hence  3  a  subsequence  xm(„)  — >  x  €  D.  Corresponding 
to  it  we  have  the  subsequence  ym(n)  =  [^m( n)i®m(n)]  _  9e'm(n)-  Now  9e>m(n)  €  B(9i,e') 
hence  3  a  subsequence  9e>k(m(n))  ~ *  9  €  B(0i,e').  Corresponding  to  this  we  have 
yk(m(n ))  =  -  ^£'fc(m(n))-  But  clearly  xk(m(n))  x  e  D.  Hence  the 

subsequence  ? lk(m(n))  — *  y  G  Et  which  proves  the  claim  that  Et  compact.  Continuing, 
we  have 

sup  sup  U(.T|^e')  —  (/(a;|0,)|  =  sup  \h(y)  —  h(y  —  u)| 

*«'  *eIRw  y€Ei 

where  h(y)  =  log  (e"*1*'1*2  +  e-^2)  is  continuous  and  |u|  <  \/2t'\  note  that  for  any 
y  €  Eit  y  -  v  e  Ei  since  9C.  +  v  =  9t  €  B(0t,e').  Clearly  h  is  uniformly  continuous 
over  Ei  so  that,  for  any  fixed  e'  >  0  we  can  find  Mi  such  that 


It  follows  that 


sup  \li(y)  -  h(y  -  t>)|  <  M,e'. 

y£Et 


sup  \g(x\9)  -  g(x\0i)\  <  Mte'  <  M'e/,  9  €  B(9U  e1) 

X 

where  M'Bo  =  max1<i<Ar(£',t!((3„£),|.|)  Finally,  take  any  function  g{x\9)  where  9  € 
B(9o ,  e).  Then  in  the  covering  of  B(0 o,  e)  there  exists  9i  such  that  1 0  —  9i\  <  e' .  Corre¬ 
sponding  to  this  0i  there  exists  a  function  g(x\9i)  such  that  sup*  \g(x\9)  -  g(x\9i)\  < 
M'6ot' .  This  implies  that  for  any  function  g(x\9)  €  &,  there  exists  a  g(x\0i)  in  the 
collection 

{(j{x\0l),g{x\92),  ■  ■  ■  ,g{x\0tf(i>,B(6o,e), I-!))} 


such  that 


Hence 


sup  |</(.t|0)  —  g{x\9i)\  <  M'eoe'. 

E|s(*|*)-flr(*|*i)l  <  M'g0e'  =  8' 
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where  expectation  is  w.r.t.  any  probability  measure  on  IR^.  So  a  subset  of  Q  with 


2  N 


2N 


( 2tM'  \ 

functions  covers  Q  to  an  accuracy  of  S'  w.r.t  Lg-norm  hence  AT  (S' ,  Q ,  Lq )  <  f  J 
for  any  measure  Q  (see  Definition  3.7).  (Henceforth  we  rename  Mg0  as  c$).  Lastly,  in 
order  to  use  Theorem  3.9  we  need  to  determine  a  bound  on  the  range  of  the  functions 
of  Q.  We  have 


sup  sup  |  log  f\  =  sup  sup 

9  xeD  9  \x-9oi\<A  or  \x-902\<A 


l°g  (|/l  +  5/a 


<  sup  sup 

0  |x— 

<  sup  sup 

0 


!°g  (5/1  +  £/*) 


log  5/. 


+  sup  sup 
9  \x—0o2\<A 


+  sup  sup 
9  |x — $02 |  ^  ^4 

log  \h 


'°g  (5/1  +  5/2) 


Define  /?  as 


=  2  sup  sup 
9  |;r— $01 


1 


i°4f' 


log  7  =  sup  sup 

P  9  |i-floi|<>4 


logj/. 


To  find  P  we  evaluate  the  maximum  shifted  Gaussian  at  the  boundary  of  the  region 
{#  :  \x  -  6>oi |  <  A }  i.e.,  at  [A,  0, ... ,  0]  (since  we  take  0Oi  =  0  as  before).  With 

-k(A+<)2 


P  = 


e  2' 


2(2; t)n/2 


we  have 


'ogi 


log 


-h(A+t? 


2(2t t)n/2 


(d+il!  +  |log2,r  +  log2. 


Now  recall  that  A2  =  c4^=  so  log  L  <  c6^=  for  some  positive  constant  c6.  We  let 
this  be  M  in  Theorem  3.9  with  a  =  c.xe2  +  c0e3  and  use  (2^L)2N  to  bound  the  covering 
number  there.  This  implies  the  probability  that  the  first  part  of  (4.7)  is  not  true  is 
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at  most  |  when  the  number  of  unlabeled  examples  is 

^  64<^JV2  (  c7e  ~c8\ 

n  5  -^rrIogT+losTJ 

-  i5r(wlog£f+logf)  (4,9) 

or  simply  when 

"2^(iv'og7  +  logD- 

With  this  many  unlabeled  examples,  the  probability  of  the  first  term  of  (4.7)  is  at 
most  6/ 4.  We  have  already  bounded  the  second  term  of  (4.7)  by  8/ 4  therefore  this 
unlabeled  sample  complexity  guarantees  that  (4.3)  holds,  and  hence  that  there  exists 
a  maximum  of  L{8 )  inside  the  ball  5(0q>  e)  with  probability  >  1  —  8/2. 


4.3.2  Global  Maximum  of  the  Likelihood  Function 

Now  we  analyze  the  conditions  needed  to  ensure  that  the  global  maximum  of  5(0)  is 
e-close  to  0O.  We  have  established  above  that  there  exists  a  maximum  of  L(0)  inside 
the  closed  ball  5(0o,e)  i.e.,  there  exist  some  9a  £  5(0O,  e)  such  that  L(0a)  >  L(9 0). 
It  remains  to  guarantee  that  for  all  9  £  Q\B(0 o,  e),  5(0)  <  L(9$)  where  0  denotes  a 
compact  region  in  lR2iV  which  contains  0O,  and  is  the  region  where  the  learner  searches 
for  the  argsup  of  L(9). 

There  is  a  small  notational  nuisance  here:  in  the  preceding,  we  used  the  com¬ 
pactness  of  5(0O,  e)  when  proving  that  the  class  Q  is  finitely  coverable.  Now  we  use 
the  notation  B(90,e)  to  mean  an  open  ball;  so  that  O\5(0o,e)  is  compact.  Following 
the  same  steps  as  before,  to  guarantee  to  within  some  confidence  that  L(0)  <  L(0O) 
it  is  sufficient  to  have 


sup 

o,0 


f(xi\9) 
/(*..' |0o) 


E  log 


f(x\0) 

/(*  |0o) 


<  a(c) 


(4.10) 
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where 


cv(e)  =  inf 
v  '  6ee\B(e0,c) 


E  log 


f{x\0) 


/(* l^o) | 

We  first  redefine  the  class  Q  =  {log  f(x\6)lD(x)  :  0  €  0\£(0o,e)}-  Then  we  estimate 
the  value  for  cv,  and  in  particular,  analyze  its  dependence  on  e  and  the  dimension 
N.  This  is  in  analogy  to  our  previous  findings  for  a  used  for  the  sample  complexity 
calculation  for  guaranteeing  the  consistency  of  a  relative  maximum  of  L{9)\  we  had 
earlier  found  the  estimate  a(e)  =  0(e2)  independent  of  N.  Then  it  remains  to  find 
the  covering  number  of  the  region  Q\B{9$ ,  e)  and  use  it  to  calculate  the  sample 
complexity  needed  such  that  (4.10)  holds.  As  before,  the  covering  number  is  still 
bounded  by  an  exponential  in  N  and  the  analysis  leading  to  (4.9)  still  holds;  we 
construct  a  fa(x)  and  a  =  C,  C  is  constant  w.r.t.  N,  e.  We  only  need  to  replace 
log/3  by  log/3©  in  the  argument  prior  to  (4.9),  where  /3©  is  the  smallest  value  that 
any  mixture  f(x\9)  can  take  over  the  domain  D  when  0  6  0;  this  is  smaller  than  /3 
but  still,  log  =  CN/y/a8  for  some  constant  C  >  0. 

We  proceed  to  estimate  ct(e)  anew  once  again  for  the  new  domain.  Denote 
$(0)  =  E  log  j$i£)  ■  For  any  given  e  >  0,  using  our  new  notation  for  the  open  ball, 
the  region  0\jB(^o,  <0  is  compact.  The  function  $  is  continuous  over  the  region  hence 
$  achieves  its  maximum  value,  which  must  differ  from  0  (shown  earlier)  hence  is 
strictly  <  0  whence  it  follows  that  cv(e)  >  0.  To  verify  that  $(0)  is  continuous  over 
this  region,  write 

${0)  =  J  f(x\90)  logf  (x\6)  dx  -  J  f(x\80)  log  f{x\60)dx. 

Let  {0„,n  >  1}  be  a  sequence  convergent  to  a  point  0a  €  O\B(80,e)  and  such  that 
for  a.  constant  p  >  0,  \9n  -  0a\  <  p  for  each  n.  It  suffices  to  show  that  the  first  term, 
denoted  by  (j{0)  is  continuous.  It  suffices  to  show 

lim  f  f  (x\9o)log  f  (x\0n)  dx  =  [  f{x\0o)  lim  log  f(x\0n)  dx  =  g(9a).  (4.11) 

n— +oo  J  J  n— KXD 
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To  justify  the  exchange  of  limit  and  integral,  note  that  log/(a:|0n)  is  continuous  and 


f(x\00)logf(x\en) 

<  f(x |0„)  |log/(*|0„)|  <  f(x |0„)  | log  i/,(i|«„,)|  <  C,f(x |0O)  +  i/(i|fo)|x  - 

<  |0„)  +  i/(*|0„)M2  +  f(x\0o)\x\c2  +  i/(x|«o  )C| 


where  Ci,  C2  are  constants  and  |#n|  <  C2  for  all  0n  6  B(6a,p).  The  integral  of  the 
right  side  is  bounded  by  some  finite  constant  w.r.t  n  (we  partition  the  integral  and 
then  bound  the  noncompact  part  similar  to  page  64)  hence  the  Lebesgue  dominated 
convergence  theorem  permits  the  exchange  of  limit  and  integral  in  (4.11). 

Now  we  show  that  «(e)  does  not  depend  on  N.  (In  the  following  we  will  ignore 
the  constant  |  for  clarity).  We  first  split  $(0)  as  follows: 


=  /  \og(e-^x-^2  +  e-l^-e^)dx 

+  /  (d^e~^~M2  w*™ + *-*]x~h?)dx 

-  /  (Vj Me-’1*'*01'2  +  e-^x~9^2)  dx 

-  J  Me"*1*-*01'2  +  e-1^-9^2)  dx 

-  [ - 1 - e-§l*-0oip  log  e_ 2 11-01 12  dx 

J  (27t)^/2  g 

+  /  ^l2  log(l  +  ^l2)  ria: 

+  /(^e"2l^°2|2l0ge'2l^2,2f/-T 
+  /  j2^yrpie~^X~902^  l°g(l  e^x~9^2 ~^x~9^2 )  dx 

~  /(^e'"'^0l|2loge"2,^0lP^ 

-  J  j^me^x~6oiV  1°g(1  +  e^x-9^2-1^-9^ 2)dx 


-  J  (2  ‘  e--11-*'"1’  log(l  +  eil*-"”11-"'1-'*11’)* 


Four  of  the  above  terms  sum  to: 


-^01  -  0i\2  -  y  -  ” |^o2  -  ^2 12  -  y  +  y  +  y  =  ~2^01  ~  ^  “  2^02  ~ 

The  other  four  terms  (those  with  the  log(l  +  eli)  we  denote  by  I\  +  I?  —  I3  —  I4.  We 
manipulate  h  as  follows: 

/,  =  /  log(l  +  dx 

=  J  Jzkmeri'’*’'~‘°i'e  log(1  + 

where  we  simply  changed  to  the  ^-coordinate  frame  whose  origin  is  at  the  point  0\. 
Now  rotate  the  coordinate  frame  to  a  new  primed  frame  as  y'  =  Qy  where  Q  is  a 
unitary  matrix  chosen  so  that  the  j/(-axis  goes  through  the  point  62  and  the  (y[,  y'2)- 
plane  goes  through  0O i-  So  we  have  y  =  QTy'  and  the  inverse  Jacobian  is  just  1  since 
the  determinant  of  Q  is  1.  Thus 

i,  =  J  log(1  +  eJiovr-iBVM.  -h?)iy' 

=  J  ] ■_e-L\QTv‘+QT(<>1-<>oiY\2  log(1  +  eh\QTv'?-\\QTv^QT(h-e2Y?)dy> 


=  J  __L_e-ilOT(!/'+(fl1-»oi)')l2  log(1  +  eh\QTy'?-h\QTW^-W)\2)dy' 

where  (0Oi  —  &i)'  is  a  coordinate  vector  w.r.t  the  primed  frame.  Observe  that 


\QTy'\2  =  y  QQ  V  =  (y'»y)  =  I y 


1 12 


74 


and 


iW+W-Oo.)')!2  =  (»'  +  («i-«oi  )')  WV  +  W  -  M') 

—  (y'  +  (0i  —  ^oi )',  y'  +  (0i —  0o,)')  —  I;/'  +  (0i  —  0oi)'l2- 


The  integral  can  hence  be  written  in  the  form 


/  los(1  +  <V 


f  1  -§lv'la-*IMila+ 

J  (27r)^/2e 

log  +  dy'. 


,  g'-gj 
g'- 

y'  on  the  y[  axis  because  we  chose  the  primed  frame  so  that  the  vector  0'2  —  0[  is  on 


The  inner- product  (?/,  |g?_g)|)  is  the  size  of  the  projection  of  the  N- dimensional  vector 


the  y[  axis.  Hence  this  equals  the  first  component  of  y'  i.e.  y[.  The  inner-product 
(y1 ,  |g9]  ~gi  | )  is  the  size  of  the  projection  of  y'  on  the  vector  8'01  —  8[  which  is  on  the 
y[,  y'2  plane  hence  it  must  be  a  function  only  of  the  first  two  elements  of  the  vectors  y' 
and  |^Igl|,  i.e.,  y[,y'2,  <  >i  and  <  >2;  denote  it  by  g(y[,y2).  (Here  the 

notation  <  •  >,•  means  the  ith  elem’ent  of  the  vector.)  For  clarity  we  rename  y',  which 


is  just  a  variable  of  integration,  to  x.  Now  we  transform  to  cylindrical  coordinate 
system: 

*x  =  *i 
x2  =  x2 

X3  =  r  COS  <f>i  COS  (j) 2  .  .  .  COS  ^>jV-3 
x4  —  r  sin  <j>4  cos  (j)2 . . .  cos  <j>N-z 

xN  =  rsin^_3. 

For  example,  for  N  =  2, 3  the  Jacobian  is  1.  For  N  =  4  the  inverse  of  the  Jacobian 
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is  r.  For  N  =  5  the  inverse  Jacobian  is: 


1  0  0  0  0 

0  1  0  0  0 

0  0  —  r  sin  fa  cos  fa  —r  cos  fa  sin  fa  cos  fa  cos  fa 

0  0  r  cos  fa  cos  fa  — r  sin  fa  sin  fa  sin  fa  cos  fa 

0  0  0  r  cos  ^2  sin  fa 

1  0  0  0  0 

0  10  0  0 

=  . 1 ,  0  0  —  r  sin  cos  fa  §  0 

0  0  0  —  rsmfa  cos  fa 

0  0  0  r  cos  fa  sin  fa 

=  — (r  cos  fa)(r). 


For  iV  =  6,  the  inverse  Jacobian  equals  — (r  cos  <j) 2  cos  fa)(r  cos  fa)(r).  In  general,  for 
N  >  4  the  Jacobian  evaluates  to  r'^-3  cos^-4^jv-3  . . .  cos  fa.  The  variables  range  over 
values  0  <  r  <  00,  0  <  fa  <  2tt,  -x/2  <  fa  <  ir/2,  2  <  k  <  N  -  3.  It  is  easy  to  see 
that  the  transformation  is  globally  invertible.  Also  we  have  |x|2  =  r2  +  sq2  +  ®22-  So 
the  integral  becomes: 

W2  \r  .  /'/2  t**  r°°  r°°  [<*>  1 

7-  =  Lncos /„  LLwr 


J-k/  2  J-ir/2  Jo 


x  log(l  +  e  2^1  ^P  K  r/aq  dx2drdfadfa  . . .  c^jv-3- 

The  integrand  does  not  depend  on  fa , . . . ,  <^jv-3  hence  we  can  write  7x  =  G(N)  •  If 


where 


/•  7r/2  w  2  y2ir 

GY  A0  =  /  cosN~4(j)N-3d(j)N-3  cos  fad  fa  /  dfa 

J- 7r/2  J-n/2  Jo 


t*  _  />°°  riV-3  1  _-}r»  ,  /°°  /°°  ^2) 

11  -  Jo  (2tt)^/2  7-00  7-00 

x  log  (l  +  e_*|9a~®>|a+l®*~®11*1)  (lx1dx2- 


To  evaluate  G(N),  start  with  the  identity 


/ - 7 - e-il*P  dx  =  1. 

J  (2tt)^2 
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Passing  to  the  cylindrical  frame  we  obtain  the  identity 

G(N)  [  rN~3  t jwie~*r2(fr  /  /  d-rj  dx2  =  1, 

Jo  \^Y  *  J-ooJ—oo 

whence 

(N  oo).  (4,2) 


-1 


r(f)  2tt3/2 

From  (4.12)  we  obtain 

/i  =  G(N)If 

/fL  X!t>  e-2^+a:2)-2|0o»-0i|2+l^"0i|3(a;i’X2)  log(l  +  e-2l^-ell2+l02-flH3:i)dx1  dx2 

froofrooe~^+^dx1dx2  ' 

From  this  we  see  that  /i  depends  on  0j,  02, 0Oi  only  through  the  transformed- coordinate 
vector  quantities:  |0qi  —  ^11?  I$2  —  ^ili  <  >i,  and  <  >2-  Similarly,  /2 

depends  only  on  \6'02  —  0'2\ ,  —  #2|,  <  [^?^Zg|[  >ii  and  <  1^2 ~^ll  >2‘  ^or  we  ^ave: 

/3  =  /  log(l  +  e£l*-MMl-M2)  dx 

-e— 2  l^l2  log(l  +  e  2  l^l2  —  2  — ^02 12 )  dx 


-I 


(2ir)N/2 


=  J  J^ym e  2  Ml  +  e  '°2  V  )dx 

where  we  chose  the  primed  frame  such  that  601  is  the  origin  and  the  xx  axis  through 
0O2  (we  did  not  bother  writing  ?/  but  kept  x  since  it  is  only  a  variable  of  integration). 
Hence  (x,  is  the  projection  of  x  on  the  Xj-axis  which  therefore  equals  xx. 

Using  the  identical  transformation  again 

1 


=  G(N)  f 


N~  3 


(2ir)N/‘ 


-e~2r2dr 


lie,  log(l  -+-  e  2 1^02  eoil2+l^o2  eoi h1)dx1dx2. 

i-CO  J -OO 
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So  /3  depends  on  \0'Q2  —  9'01\.  Similarly  for  /4. 

Now,  given  any  four  iV-dimensional  points  0j ,  02,0oi,  002  we  can  apply  the  follow¬ 
ing  procedure  to  form  a  vector  v : 

•  Let  0i  be  the  origin. 

•  Choose  a  primed- iV-dimensional  frame  s.t.  02  lies  on  the  yj-axis  and  0oi  on  the 
y[,  y^plane. 


Record  the  values 


vx  =  \9'01  -  $[ \,v2  _  |0j  0il,n3  -<  ^  >i,n4  -<  ^  >2 


•  Let  02  be  the  origin. 

•  Then  choose  another  primediV-dimensional  frame  s.t.  0i  lies  on  the  yj-axis  and 
0O2  on  the  y[ ,  y'2-p\a.ne. 


•  Record  the  values 


V$  —  |0Q2  ^2  )  ^6  — ^ 


0Q2  -  Oi 


1^02  ^2 


>l,t>7  =< 


^02  ~  ^2 
W02  - 


>2)  v8  =  |0qi  —  0Q2 


Recall  that  the  values  of  the  other  terms  besides  the  I\, . . . ,  /4,  depended  on  |0oi  —  002 1 
(which  equals  u8),  hence  it  follows  that  $(0)  is  a  function  only  of  the  vector  v.  But 
there  exist  four  3D-vectors,  0i,  025  0oii  002)  with  the  same  vector  v.  This  follows  since 
our  four  points  in  iV-dimensional  space  lie  on  some  3Z)-subspace.  Let  N  =  3  and 
apply  the  above  procedure  to  these  points  (which  have  3 D  coordinate  vectors  w.r.t 
some  frame  in  this  3D-subspa.ce).  This  must  yield  the  same  vector  v  because  we  did 
not  disturb  the  points  in  any  way.  The  vector  v  lives  in  some  subset  V  C  IR8  and  the 
function  $  maps  vectors  v  £  V  to  IR1 .  In  particular,  take  any  4  points  0oi,  0o2>  0i  ?  02  in 
iV-dimensions  such  that  |0  —  0q|  >  e  or  equivalent  by  |0qi  —  0i|2+  |0o2  —  ^2 12  >  c2 •  Then 
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apply  the  procedure  to  get  the  vector  v.  As  before,  there  exist  4  points,  0oi>  $02;  $1  ,  $2 
in  3D  with  the  same  v  therefore  with  the  same  |0O1  —  #1 1  and  \0O2  —  02\  therefore  the 
same  \9  —  9q\  >  e  as  the  N- dimensional  points.  So  for  any  N  >  3,  fixing  |$oi  —  ^02 1 
results  in  fixing  sup^^^  ^  4>(0).  Recall  that  supe€0\B(5o  £)  $(0)  =  —  <*(e).  The  above 
implies  that,  for  a  fixed  e  >  0  and  |0Oi  -  #02!)  «(e)  is  invariant  for  all  N  >  3.  This 
validates  the  earlier  claim,  in  the  case  of  small  e  >  0,  that  a(e)  is  independent  of  N. 

At  this  stage  we’ve  shown  that  given  any  e  >  0,  there  exists  «(e)  >  0  which  when 
used  in  (4.10),  yields  a  finite  sample  complexity  that  guarantees  with  some  confidence, 
that  the  global  maximum  of  L(9)  is  e-close  to  the  true  unknown  parameter  9q  €  1R2JV. 
This  <*(e)  is  constant  for  all  N  >  3  hence  its  effect  on  the  sample  complexity  cannot 
worsen  (i.e.,  «(e)  cannot  decrease)  with  increasing  N.  But  we  are  still  left  with 
the  question  of  how  fast  a(e)  can  decrease  with  e;  if,  for  instance,  it  is  decreasing 
as  quickly  as  e-1/£  then  the  sample  complexity  n  will  grow  exponentially  with  the 
accuracy  parameter  e. 

Consider  a  ball  B(90,  (!)  with  d  >  0  such  that  (using  our  previous  results)  we 
have  $(0£)  =  —  \c\(2  +  cqC3  for  all  9t  on  the  surface  dB(90,e ),  with  0  <  e  <  d.  As  $ 
is  continuous  it  achieves  its  maximum  value  over  the  region  Q\B(90,  d)  at  a  point  9a 
(where  B(90,d)  is  an  open  ball).  Now,  let  9b  be  the  farthest  point  from  90  such  that 
9b  is  in  the  closed  ball  B(90,d)  with  ${9a)  <  ${9b).  ( 9b  could  be  on  <9-B(0o>  O)-  By 
simple  arguments  it  follows  that  for  all  9C  such  that  \9t  —  0O|  <  |0&  —  9o\,  a(e)  >  |$(0t)| 
=  \c\e2  +  Coe3,  and  it  is  true  for  all  N  >  3.  Hence  for  all  sufficiently  small  e  >  0  we 
can  estimate  cr(e)  =  Cie2  for  all  N  >  3. 

Hence  we  may  proceed  using  c1e2  for  a(e)  in  (4.10)  and  for  all  sufficiently  small 
e  >  0,  the  bound  on  the  unlabeled  sample  complexity  (4.9)  not  only  guarantees  that 
there  exists  a  maximum  of  L(9 )  e.-close  to  60  but  is  also  the  global  maximum,  i.e. 
that  9  is  e-close  to  90.  Therefore  we  have  P({|^i  —  #oi|  >  e}U{|^2  —  #02!  >  e})  < 


79 


P(|0-0o|>e)<f. 

We’ve  established  above  that  the  estimates  0\  and  02  are  at  most  (- away  from  $01 
and  0Q2  respectively.  Using  the  analysis  of  Section  4.1,  the  classification  error  (under 
optimal  labeling)  can  be  written  as  Perror  =  PBayes(  1  +  ci2C2)).  We  can  replace  e2  by 
e  here  and  in  (4.9)  to  get  that  with 

n  =  Cj7r(Nhii+losi) 

unlabeled  examples,  the  Perror  of  the  optimal  labeling  of  the  decision  regions  is 
PBayes(l  +  ci2e).  It  only  remains  to  use  labeled  examples  to  guarantee  that  we  pick 
this  optimal  labeling. 

4.3.3  Labeling  the  Partition 

A  A 

We  have  two  unlabeled  regions  separated  by  the  hyperplane  between  $i  and  62  where 
both  are  e-close  to  their  respective  true  parameters.  The  good  labeling  has  the  above 
classification  error.  We  use  the  labeled  examples  to  control  the  confidence  of  picking 
the  good  labeling  by  the  majority  rule.  We  have  the  two  regions  Ru  R2  on  each  side 
of  the  hyperplane.  Draw  nr  labeled  examples.  Assign  to  each  region  the  label  of 
the  majority  of  the  examples  that  fell  in  it.  If  no  examples  fell  in  Rt  then  label  it 
“1”  with  probability  \  and  “2”  with  probability  f.  We  now  calculate  m  needed  to 
guarantee  we  pick  the  good  labeling.  Denote  r/i  =  P(2|.,c  €  R\),  1)2  =  P(I|a:  €  ^2)) 
and  p  =  fRl  f(x)  dx.  The  quantities  rq  and  r]2  are  the  probabilities  that  a  randomly 
drawn  x  is  misclassified  given  it  is  in  or  f?,2,  respectively.  Also,  p  =  P(#i)  and 
1  -  p  =  P(i?2).  Let  pmin  =  min(p,  1  -  p)  and  pmax  =  max(r;1,  r/2).  Let  the  event 
{a  random  x  is  misclassified}  =  E.  There  are  four  possible  labelings:  Lgood  has  R\ 
labeled  “1”  and  R2  labeled  “2”;  Lbadyl  has  Rx  labeled  “2”  and  R2  labeled  “1”;  Lbady2 
has  R\  and  R2  labeled  “1”;  Lbad, 3  has  R.x  and  f?.2  labeled  “2”.  With  the  same  analysis 
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as  in  Section  2  with  the  replacement  of 

8  =  12e-mPm,"(1-2 

we  obtain  the  probability  of  not  choosing  Lgooci  is 

(  8  8  8  \  1 
12  ^  24  94 J  =  2^' 

Using  the  analysis  of  the  error  of  the  decision  rule  in  IR^  of  Section  4.1  we  get 
that  pmin  >  \  -  cue  and  T]max  <  Psayes  +  Ci5e.  Plugging  this  into  the  exponential 
bound,  for  suitably  small  e,  we  get  m  =  c16  log  |  is  sufficient  to  guarantee  that  we  do 
not  pick  Lgoo(i  with  low  probability,  i.e., 

P  [Perror  >  PBayes(l  +  C17t)  |{|^  -  901\  <  6>  D  {|<?2  -  M  <  «})  < 

Combining  this  with  the  fact  that 

P  ({|01  -  ^Oil  >  e>  U  {\e2  -  e02\  >  e})  <  P  (|0  -  90\  >  e)  <  | 

completes  the  proof  of  the  theorem.  I 
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4.4  General  a  priori  Class  Probabilities 


Here  we  extend  the  results  of  Section  4.1  to  the  case  of  general  a  priori  class  proba¬ 
bilities,  p  and  1  -  p.  The  learner  uses  algorithm  Ep  with  the  randomly  drawn  labeled 
examples  to  construct  the  decision  rule. 

Algorithm  Ep: 


The  setting:  Two  pattern  classes  with  underlying  Gaussian  mixture  density 

/(a#o,p])  =P9(x\noi)  +  (1  -  p)g(x\noa). 

The  teacher  draws  labeled  examples  randomly  at  least  once  according  to 
f(x)  by  choosing  class  “1”  with  probability  p  class  “2”  with  probability 
1  —  p  and  then  drawing  according  to  the  selected  class  conditional  density 
9(x\fioi)i  i  =  1,2. 

Given:  mi  examples  labeled  as  “1”  and  m2  examples  labeled  as  “2”,  where 

mi  +  m2  =  m  >  0. 


Begin: 


1)  If  ???i  =  0  then  label  all  of  IR^  as  “2”.  If  m2  =  0  then  label  all  of  IRW 
as  “1” .  Go  to  End. 

2)  Otherwise,  continue  with  the  following  steps. 

3)  Let  the  mean  estimates  be 


4) 


| 

A.  =  —  £4  (*  =  i.2) 

m«  k= l 

where  x'k  denotes  the  kth  component  of  the  example  vector  a:,-. 
Estimate  p  by 

1  m 

P  =  “Eb(=“i" 
ni  1= 1 

where  j /,•  is  the  label  of  the  ith  example. 
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5)  Let  the  decision  border  be  the  hyperplane  defined  by 

h(x)  =  (x  -  /<i,  /<2  ~  Mi)  ~  7}lM2  ~  Ail2  “  l°g  =  °- 

6)  The  classifier  decides  “1”  for  x  if  k(x)  <  0  and  decides  “2”  otherwise. 

I 

End. 


The  difference  here,  compared  to  the  case  of  p  =  is  that  the  Bayes  decision 
border  depends  on  p,  i.e., 

1  P 

h(x)  =  (x  —  /toi,/io2  —  Poi)  —  2^<02  —  ^01l2  —  log ~  =  0. 

In  this  case 

Pb.,„  =  (1  -  ?)*  log  ^  -  a)  +  P*  l°g ^  -  a) 

where  A  =  1W1  an(l  denotes  the  standard  normal  probability  distribution. 
(We  will  use  c;  to  denote  finite  positive  constants  as  before).  So  we  need  to  estimate 
p  by  p  in  order  to  form  an  estimate  of  the  Bayes  decision  border. 

We  now  determine  the  labeled  sample  complexity  m.  W.l.o.g.  we  assume  p  < 
1  —  p.  Denote  by  A  the  event  that  the  decision  rule  is  as  line  (1)  in  the  algorithm, 
and  let  Ac  denote  the  complement.  We  have 

Perror  =  P  (error | T)  P  (A)  +  P  (error |AC)  P  ( Ac ) . 

Now 

P(A)  =  P({?77.1  =  0}  or  {m2  =  0})  <  2(1  -  p)m 

and 

P(AC)  =  P( { 777. i  >  0}  and  {m2  >  0})  <  P({?77i  >  0})  =  1— P(?7ii  =  0)  =  1—  (1—  p)m . 
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Now  conditioned  under  event  A ,  the  probability  of  error  is  determined  as  follows: 
given  an  x  labeled  as  “1”  (with  probability  p)  the  algorithm  misclassifies  it  only 
if  mx  =  0  which  has  probability  (1  -  p)m.  Similarly,  given  an  x  labeled  as  “2” 
(with  probability  1  -  p)  then  the  algorithm  misclassifies  it  only  if  m2  =  0  which  has 
probability  pm.  Using  the  inequality  (1  -  p)m  <  e~mp  we  have 

P(enov\A)  =  (l-p)mp  +  pm(l~p)  =  p  ((1  -  p)m  +  Pm_1(l  -  p)) 

<  P(l  -  P)  ((1  -  P)m_1  +  Pm_1) 

<  2p(l  -p)e-(m~1)p  <2epe~mp. 


So 


P  (error |  A)  P  (A)  <  4 eper2mp. 


Define  the  event 

E  =  {  | p.i  -  <  e,  i  =  1,2,  and  \p-p\  <  e  }  . 


We  bound  the  term  P  (error|i4c).  We  have 

P  (error | Ac)  =  P(error|E)P(E)  +  P(error|£;c)P(Ec)  <  P(error|£)  +  P(EC). 

In  the  rest  of  this  section  we  estimate  the  two  terms  on  the  right. 

First  we  determine  P(EC).  This  is  bounded  above  by  the  probability  that  \p-p\  > 
e  added  to  the  probability  that  at  least  one  of  the  mean  estimates  deviate  by  more 
than  e  from  the  corresponding  true  mean. 

For  p  we  use  the  obvious  estimate,  i.e.,  p  =  ±  ZZi  lyi=-i"  where  Vi  is  the  label 
of  the  ith  example.  For  this  we  have 

P(|P  — Pl>ei)<2e-2mt?  =81. 

To  estimate  p0i  and  p02  we  use  the  same  ideas  as  in  Section  4.1.  This  means  the 
requirement  is  to  have  mx  and  ^  log  examples  from  class  1  and  the  same 


84 


number  from  class  “2” ,  respectively.  The  teacher  draws  labeled  examples  by  using 
the  mixture,  i.e.,  first  choosing  class  “1”  according  to  the  a  priori  probability  p  then 
drawing  according  to  the  selected  class  distribution..  If  p  is  at  most  cj-far  from  p 
then  we  do  not  need  m  to  be  larger  than  in  order  to  get  m j  class-  1  examples. 
The  former  has  probability  >  1  —  Sx.  Similarly,  m  suffices  to  be  >  in  order  to 

get  m2  class- “2”  examples  with  probability  >  1  -Si.  Equivalently,  with  probability 
>1-2 6X,  m  needs  to  be  at  most  to  get  mx  examples  of  class  “1”  and  at  most 
to  Set  m2  examples  of  class  “2” .  Clearly  it  suffices  to  take 


m  >  max 


( 


2  N 


log 


4  N 


2  N 


(p  -  ei)e2  S  ’  (1  -p  -  ex)e2 


log 


4 N\ 

&  J 


to  get  mi  class- “1”  examples  and  m2  class- “2”  examples  with  probability  >  1  -  2SX. 
Using  our  assumption  of  p  <  1  -  p,  the  above  requirement  is  to  have  at  least 


m  = 


2N 

(p-ei)c2 


which  guarantees  that  with  probability  at  least  1  —  26x  we  get  the  necessary  mi  and 
m2  s.t.  with  probability  at  least  1  —  6,  |/<oi  —  Aoil  ^  e  an(l  1/^02  —  A02I  ^  e  (f°r  fhe  last 
part  see  Section  4.1). 

Therefore 

P (Ec)  <  (26j  +  6) +  61  <  36i  +  6. 


Now  we  determine  P(  error  IE)  i.e.,  the  Perror  of  the  decision  border  based  on  the 
above  e-close  estimates. 

As  in  Section  4.1,  p,x  and  //2 ,  are  e-close  to  poi  and  P02  respectively.  The  decision 
border  is  obtained  by  plugging  the  estimates  into  the  functional  form  and  solving  for 
x  in 

e-£|*-w|2  _l-  p 
e— ||i— A2I2  p 
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which  yields  a  decision  border 


h{x)  =  (a:,  /<!  -  /t2)  + 


IA2I2  -  IA1I2  _log  1  -P 


=  0. 


2  -  p 

As  in  Section  4.1,  conditioned  on  the  high  probability  event  that  the  estimates  /i,-, 
p  are  e-close  to  p  respectively,  we  have  h(x)  is  distributed  as  a  one-dimensional 
Gaussian.  For  g(x)  =  h(x)/\jli  —  //2|  we  have 

,  ,  AT({h-lh)Tu~  +  KIA2I2  -  lAil2)  ^ 

Mg)  ~  N  [ - i*nSl - -1] 


and 


p2(g)  ~  N 


\h  -h)Tu+  +  i(|£2|2-N2)-iog±f 


,1 


\fa  -  Ail 

where  u~ ,  u+  are  defined  in  Section  4.1.  So  we  get  a  one  dimensional  decision  problem 
which  has  the  same  PeTror  for  the  decision  rule  as  the  original  TV-dimensional  problem. 
Considering  the  configuration  of  pi  and  jx2  that,  yields  a  good  upper  bound  on  PeTToTi 
we  get 

Perror  <  (1  —  p)$  A  —  l°g — ■—  +  Ae  +  CqC2  +  C\ 


+  p$  ^-A+  2Xlog“P  A..„.2  £l 


-  Ae  +  c2e2  -  c3 —  . 

P 


Breaking  the  4>()  into  two  parts  and  additional  bounding  gives 
p 

1  error 


<  (1  —  p)$  f — A  —  ^  log 


+  p$  f-A+^logi-^ 


+  c4  e  H — - 


Pb  ayes  +  C4  +  —  J  . 


So 


P  (error  |AC)  <  Pb  ayes  +  c4  e  4-  )  +  3<5i  +  8 


p* 
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and  hence 


Perror  <  (pBayes  +  <=4  (e  +  ^  +  S61  +  (1  -  (1  -  ?)”*)  +  iePe~2mp . 

So  for  arbitrary  0  <  a,  /?, 7,  in  order  to  have  an  error 

Perror  ^  ( PBayes  4“  Oi)fi  +  7 


the  sufficient  sample  complexity  m  is 


m  =  c5  max 


N  .  N  logT^ 

I  ’  i - i”» 

a2p4  a  log  Yip 


For  a  prespecified  PeTror,  m  is  polynomial  in  but  as  p  — >  0  we  can  let  a  —*  oo, 
(5,7  0,  such  that  m  — >  1  and  Perror  -*  0-  (Note,  we  used  the  fact  that  algorithm 

Ep  draws  at  least  one  labeled  example).  Now,  for  fixed  0  <  p  <  but  increasing  N 
or  decreasing  a,  m  grows  like 


N 


ce 


i  N 

log—. 


a2p 4  a 


This  is  further  discussed  in  Chapter  6. 


4.5  Mixed  Sample 


In  this  section  we  use  both  labeled  and  unlabeled  examples  for  learning  the  decision 
rule.  As  in  Section  4.4,  the  Bayes  decision  border  depends  on  the  two  means  p0 1,  p02 
and  on  the  a  priori  class  “1”  probability  p.  So  we  need  to  estimate  these  by  /<oi,  P02 
and  p  and  use 


h{x)  =  (x,  />.0i  -  pm)  + 


|p02 12  ~  lAoil2 
2 


-  log 


1  -p 
p 


=  0. 


as  the  decision  border  estimate. 

We  consider  two  approaches  of  utilizing  the  mixed  sample:  the  first  is  based  on 
algorithm  Mi,  which  uses  the  labeled  examples  to  estimate  p  by  p  to  an  accuracy  t\ 


S7 


1 


and  confidence  >  1  —  5i5  and  the  unlabeled  examples  to  estimate  p0  =  [poi>^o2]  to 
an  accuracy  e  and  confidence  >1  —  5  using  the  MLE  procedure. 

The  second  approach  estimates  the  vector  Oq  =  [poi,  P02,  p]  using  the  MLE  proce¬ 
dure  with  unlabeled  examples  and  uses  labeled  examples  only  for  labeling  the  decision 
regions.  We  assume  w.l.o.g.  that  p  <  1  —  p.  We  start  with  the  first  method. 


4.5.1  Learning  using  algorithm  Mi 


Algorithm  M\: 


The  setting:  Two  pattern  classes  with  an  underlying  Gaussian  mixture  density 

f(x\no,p)  =  pg(x  |/ioi)  +  (1  -  p)g(x  ^02) 


Given: 

Begin: 


with  po  €  M  where  M  is  a  compact  subset  of  JR2N .  The  teacher  draws 
labeled  and  unlabeled  examples  according  to  /  by  choosing  class  “1”  with 
probability  p ,  class  “2”  with  probability  1  —  p,  and  then  drawing  according 
to  the  selected  class  conditional  density  g(x\p0i),  i  =  1,2. 

First,  the  teacher  draws  m  =  mj  +  m2  >  0  labeled  examples. 

1)  If  mi  =  0  then  label  all  of  IRW  as  “2”.  If  m2  =  0  then  label  all  of  IR^ 
as  “1”.  Go  to  End. 

2)  Otherwise  request  n  >  0  unlabeled  examples. 

3)  Estimate  p  by 

^  m 

p  =  !»=“!" 

where  ?/,■  is  the  label  of  the  ith  example. 

4)  Using  the  unlabeled  sample,  estimate  the  mean  vector  po  =  [^01^02] 
by  the  point  p, 


n 

p  =  argsup/t6A1-  ^logfixilpip). 

n  i=i 
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End. 


5)  Let  the  decision  border  be  the  hyperplane  define  by 

h(x)  =  (x-  fa,  fa  2-  frl)  ~^\ fa-  M2  -log  =  0. 

6)  The  classifier  decides  “1”  for  x  if  h(x)  <  0  and  decides  “2”  otherwise. 


As  in  Section  4.4,  we  estimate  p  using  the  labeled  sample  to  have  that 

P  (|p  -  p|  >  Cl)  <  2e-2m£>  =  #1. 

However  now  we  use  the  unlabeled  sample  to  estimate  the  means  and  get  that  the 

probability  of  either  estimate  being  >  e  from  the  true  values  is  at  most  S/2  when 

N2 log2  \  (  i  i\ 

n  =  cx-^— (iVlog7  +  log-) 

and  we  select  the  bad  labeling  of  the  hyperplane  partition  with  probability  at  most 
8/2  when  m  is  at  least 

1  i  1 

- —  log  T- 

c2  p  +  c3  0 

The  analysis  follows  as  in  Section  4.4  to  obtain 

Perror  <  ^ PBayes  +  C4  ^ j  +  ^  +  (1  —  (1  —  p)™)  +  4 epe  2mp. 

So  for  arbitrary  0  <  a,  /?, 7,  in  order  to  have  an  error 

P err  or  —  ( PBayes  T  Of)/?  T  'Y 
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For  a  prespecified  Perron  m  is  polynomial  in  1,  and  n  is  polynomial  in  log  1,  For 
p  — >  0,  we  can  let  a  — >  oo,  and  let  both  /3, 7  *  0,  such  that  m  — ■>  1,  n  —*  0  and 

Per r or  — ►  0  (Note  that  PBayes  0  as  p  -*  0,  and  we  used  the  fact  that  the  algorithm 
requires  at  least  one  labeled  example.)  For  a  fixed  0  <  p  <  m  depends  on  a  as 


and  n  depends  on  a,  N,  as 

iVMogi 

C9  6 


Hence  n  is  polynomial  in  N,  and  while  m  only  depends  on  a  and  is  polynomial  in 

i 
a  * 

This  is  further  discussed  in  Chapter  6.  Now  consider  the  second  approach. 


4.5.2  Learning  with  algorithm  M2 


Algorithm  M2.' 


The  setting:  Two  pattern  classes  with  an  underlying  Gaussian  mixture  density 

f(x\0o)  =  pg(x\p0i)  +  (1  -  P)<7(z|/%2) 

where  0o  =  [p0i, P02, p]  €  0,  and  0  is  a  compact  subset  of  IR2iV+1,  such 
that  the  teacher  draws  labeled  and  unlabeled  examples  according  to  /  by 
choosing  class  “1”  with  probability  p,  class  “2”  with  probability  1  -  p,  and 
then  drawing  according  to  the  selected  class  conditional  density  g(x\poi), 
*  =  1,2. 

Given:  The  teacher  draws  m  =  mi  +  m2  >  0  labeled  examples. 
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Begin: 


1 )  If  mi  —  0  then  label  all  of  IR^  as  “2” .  If  m2  =  0  then  label  all  of  IRW 
as  “1”.  Go  to  End. 

2)  Otherwise  request  n  >  0  unlabeled  examples. 

3)  Estimate  90  by 


1  n 

9  -  argsup0€0—  5jlog/(®i|0). 
n  «= 1 

4)  Let  the  decision  border  be  the  hyperplane  defined  by 

1  p 

h{x)  =  (x-  pu  fi2  ~  h)  -  2^2  -  All2  -  log  =  °- 

5)  Label  the  two  decision  regions  across  the  hyperplane  by  the  label  of 
the  majority  of  the  examples  in  each  region. 

End . 


We  proceed  as  in  the  previous  two  sections,  except  now  the  unlabeled  examples 
are  used  to  estimate  the  means  and  p,  while  the  labeled  examples  are  just  used  for 
picking  the  labeling.  Then  with  algorithm  M2  it  suffices  to  draw 


n  = 


Ci  TV2  log2  i 
e6S 


unlabeled  examples  to  yield  estimates  p,  6\,  02)  eacfi  e-close  to  its  corresponding 
unknown  parameter.  (cx  is  larger  than  cx  in  the  expression  for  n  in  Section  4.5.1.) 
The  decision  rule  based  on  these  estimates  has 


^  Pl3ayes  "t”  C2C 


when  labeled  with  the  optimal  labeling  which  is  picked  with  confidence  >1  —  6  when 
at  least 


m 


C3 

c4p  + 1 


labeled  examples  are  drawn. 
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We  obtain  that  for 


terror  ^  (F Bayes  +  a)/3  +  7 


the  sufficient  sample  complexity  m  is 


m  =  c5  max 


1  1  logiq;  1,  p 

c4p+  1  °g  a  ’  log  ’  p  °S  7 


and  the  sufficient  n  is 


n  =  c6 


N3  log2  ±  log  ^ 

a7pU 


for  arbitrary  0  <  a,/3, 7.  For  a  prespecified  Pe rroT,  m  is  practically  uneffected  by  p, 
and  n  is  a  polynomial  in  K  However,  as  in  the  preceding  sections,  for  p  — *  0  we  can 
let  a  — >  00,  and  let  both  /?,  7  — ►  0,  such  that  m  — *  1,  n  — >  0,  and  Perror  —*  0  (where 
we  used  the  fact  that  algorithm  M2  draws  at  least  one  labeled  example).  For  fixed 
0  <  p  <  but  variable  N  and  a,  m  is  constant,  while 


n  =  c7 


A^logi 


So  n  is  effected  by  N,  growing  polynomial  in  N,  and 
This  is  further  discussed  in  Chapter  6. 


Chapter  5 

Nonparametric  Scenario 


In  this  chapter  we  study  another  approach  to  learning  classification  of  two  pattern 
classes.  The  approach  is  called  Kernel  estimation — a  non  parametric  approach  which 
assumes  very  little  knowledge  about  the  form  of  the  distribution.  This  method  can 
be  used  with  both  a  purely  labeled  or  a  mixed  scenario.  When  learning  with  a 
purely  labeled  sample  one  would  estimate  each  of  the  nonparametric  class  conditional 
densities  and  then  use  the  estimates  for  defining  an  estimate  decision  rule  using 
the  Bayesian  approach  (Chapter  1).  This  falls  under  the  field  of  Nonparametric 
Discriminant  Analysis  (cf.  Silverman  [40]).  Theoretical  studies  of  nonparametric 
discrimination  techniques  indicate  that  they  yield  asymptotically  optimal  decision 
rules  (cf.  Silverman  [40]). 

We  are  interested  in  using  the  kernel  technique  with  the  mixed  sample  scenario. 
In  the  last  chapter,  we  found  the  unlabeled  and  labeled  sample  complexities  of  learn¬ 
ing  a  Gaussian  mixture.  It  is  interesting  to  ask  what  is  the  complexity  of  learning 
the  problem,  based  on  the  same  underlying  mixture,  with  the  nonparametric  kernel 
estimation.  In  this  scenario  the  learner  does  not  utilize  the  information  that  he  had 
in  Chapter  4,  namely  that  /  can  be  indexed  by  a  finite  real  vector  and  that  the  deci¬ 
sion  rule  can  be  determined  from  the  class  conditional  densities  which  can  be  inferred 
uniquely  from  the  mixture  estimate. 
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So  investigating  the  complexity  of  a  nonparametric  method  for  such  a  family 
shows  us  how  much  more  complexity,  in  terms  of  sample  sizes,  is  needed  when,  for 
example,  parametric  side  information  is  not  available.  We  then  show  how  to  ex¬ 
tend  this  approach  to  a  large  nonparametric  class  of  distributions  which  includes  the 
Gaussian  mixtures. 

One  way  to  try  to  use  the  Kernel  method  with  unlabeled  examples  is  by  limiting 
consideration  to  a  family  of  problems  whose  underlying  mixture  f(x )  is  identifiable 
(not  necessarily  parametric)  (cf.  Cover  &  Castelli  [5]).  The  family  is  chosen  so  that 
given  f(x),  it  is  possible  to  uniquely  determine  its  components,  fi(x),  f2(x),  and 
the  a  priori  probabilities  px,  p2.  The  mixed  sample  can  be  used  by  some  nonpara¬ 
metric  method,  say  kernel  estimation,  to  estimate  f{x )  by  /„(x)  to  within  accu¬ 
racy  e  uniformly  over  x.  Then,  there  exists  an  identifiable  function  f(x),  such  that 
\f(x)  —  f(x) |  <  e.  By  careful  selection  of  the  functions  that  are  elements  of  this 
family,  it  may  be  possible  to  have  the  latter  imply  that  the  two  corresponding  com¬ 
ponents  are  close,  i.e.,  |p,/j  -  pt/,|  <  2c,  i  =  1,2.  Then  construct  the  decision  rule 
based  on  px,  p2,  A>  f2  which  has  PeTror  close  to  Psayes-  The  goal  would  be  to  try  to 
match  the  richness  of  this  family  with  the  power  of  the  estimation  technique,  e.g., 
the  kernel  method  has  only  a  few  restrictions  on  the  types  of  /  that  can  be  estimated 
and  therefore  it  can  handle  a  very  rich  family  of  functions.  This  way  the  large  sample 
complexities  (which  we  expect  for  a  powerful  estimation  technique)  will  be  justifiable 
for  learning  the  family  of  functions  that  we  defined  above. 

One  difficulty  in  this  approach  is  in  finding  /  from  /„,  especially  if  the  identi¬ 
fiable  family  of  mixtures  is  nonparametric.  This  translates  into  difficulty  in  finding 
the  decision  rule  estimate.  While  in  principle  (with  an  appeal  to  the  continuum  hy¬ 
pothesis)  it  may  be  possible  to  order  all  functions  in  this  family  and  then  search  for 
any  function  that  is  e  close  to  fn  (we  know  that  there  exist  at  least  one,  namely  the 
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Figure  5.1: 

unknown  true  mixture  /),  it  is  not  clear  that  a  practical  polynomial-time  algorithm 
can  be  constructed  to  find  such  an  /. 

Consider  for  a  moment  a  Gaussian  mixture  (Figure  5.1).  It  appears  that  the 
modes  (i.e.,  the  two  global  maxima)  of  this  one-dimensional  mixture  may  determine 
the  Bayes  border.  In  Lemma  5.5  we  show  that  this  is  true  for  the  IV-dimensional 
mixture,  while  the  condition  that  the  means  of  the  class  conditional  densities  are  a 
certain  distance  apart.  This  suggests  that  there  may  be  another  approach  using  the 
kernel  technique  for  constructing  a  classifier.  There  may  be  an  algorithm  that  first 
estimates  the  mixture  /  by  fn(x)  (where  the  subscript  n  shows  the  dependence  on 
the  sample)  to  within  e-accuracy,  then  determines  consistent  estimates,  fft,  7)2 ,  of  the 
modes  f/j,  t/2,  of  /  using  fn{x).  This  method  would  be  categorized  as  pseudo-direct 
because  it  skips  the  estimation  of  the  class  conditional  densities,  however  it  still  uses 
density  estimation  for  the  mixture.  Of  course,  this  approach  can  be  used  for  a  large 
generic  class  of  nonparametric  problems  whose  Bayes  border  is  determined  by  modes. 
We  pursue  this  approach  here. 

To  begin  with,  we  will  consider  an  algorithm,  called  algorithm,  K,  which  follows 
the  above  intuition,  and  learns  the  classification  problem  with  an  underlying  Gaussian 
mixture.  We  then  present  a  description  of  a  rich  nonparametric  family  of  mixtures 
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that  contains  the  Gaussian  mixture,  in  addition  to  other  types  of  mixtures  which 
algorithm  K  can  handle.  For  every  mixture  /  in  this  family,  the  Bayes  border  is  a 
linear  hyperplane  and  is  identified  by  the  modes  of  /. 

In  this  nonparametric  scenario,  if  we  view  the  problem  of  finding  the  best  esti¬ 
mate  fn(x)  for  f(x)  as  a  learning  problem  in  the  generalized  framework  of  PAC  (cf. 
[12])  then  we  expect  that  a  larger  sample  is  needed  to  learn  f(x )  (compared  to  the 
parametric  scenario  of  Chapter  4)  since  the  class  of  functions  of  which  fn(x),  f(x)  are 
members  is  significantly  richer.  Our  analysis  of  kernel  estimation  uses  the  principle 
of  the  uniform  SLLN  differently  from  the  PAC  framework,  i.e.,  there  is  no  overall  em¬ 
pirical  loss  which  is  minimized  in  order  to  learn  f(x).  Nonetheless  a  quantity  called 
the  VC-dimension  (see  Chapter  3)  which  bounds  the  covering  number  through  the 
expression  of  (3.8),  will  emerge  as  a  clear  indication  of  the  complexity  that  is  involved 
in  learning  f{x)  when  no  parametric  knowledge  is  assumed  about  its  form.  In  Chap¬ 
ter  4  we  saw  that  the  complexity  of  the  problem  of  learning  a  Gaussian  mixture  as 
an  element  of  the  parametric  class  of  Gaussian  mixtures  was  reflected  in  the  covering 
number  of  a  related  class  (the  class  Q  on  page  63).  So  the  covering  number  plays 
an  important  role  in  realizing  the  difference  in  complexities  of  learning  a  particular 
function  /( x)  both  in  the  parametric  and  nonparametric  scenarios. 

The  results  Chapter  5  indicate  that  for  the  nonparametric  scenario  a  sufficient 
unlabeled  sample  for  learning  the  Gaussian  mixture  problem  is  significantly  larger 
than  in  the  parametric  scenario  of  Chapter  4.  The  labeled  sample  size  is  the  same, 
hence  the  value  of  a  labeled  example  is  significantly  higher  when  less  side  informa¬ 
tion  is  at  hand.  This  is  discussed  further  in  Chapter  6  where  we  discourse  on  the 
measurement  of  this  value. 

The  material  of  this  chapter  is  organized  as  follows:  In  Section  5.1  we  provide  a 
review  of  kernel  estimation,  which  contains  the  important  points  that  we  need  in  later 
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sections.  In  Section  5.2  we  investigate  learning  the  classification  problem  based  on 
a  Gaussian  mixture,  using  algorithm  K.  The  main  result  of  this  section  is  embodied 
in  Theorem  5.1  which  gives  the  mixed  sample  complexity  for  learning  with  algorithm 
K.  Before  proving  the  theorem  in  Section  5.4  we  state  several  lemmas  in  Section  5.3. 

In  Section  5.5  we  consider  learning  using  algorithm  K  (with  the  same  sample 
complexities  stated  in  Theorem  5.1)  a  classification  problem  which  is  based  on  pat¬ 
tern  class  densities  whose  mixture  /  is  of  a  more  general  type.  We  describe  the 
nonparametric  family,  and  then  prove  the  consistency  of  the  mode  estimates  which 
are  constructed  by  algorithm  K. 

In  Section  5.6  we  present  an  alternate  nonparametric  technique —  learning  vec¬ 
tor  quantization  in  neural  networks,  which  can  use  unlabeled  examples  and  labeled 
examples  in  learning  a  classification  rule  by  exploiting  clustering.  In  Section  5.7  we 
analyze  a  learning-classification  approach,  called  k-means,  which  is  based  on  mini¬ 
mizing  the  empirical  MSE  of  the  voronoi  partition  on  which  the  classifier  is  based. 
We  find  the  labeled  and  unlabeled  sample  complexities  for  learning  a  general  decision 
problem  having  well-separated  pattern  classes. 

We  now  discuss  the  principle  ideas  behind  the  Kernel  technique  (cf.  Silverman 
[40],  Duda  &  Hart  [1]). 

5.1  Kernel  Density  Estimation:  A  Review 

The  naive  one-dimensiona.1  kernel  density  estimate  is  based  on  the  idea  of  placing  a 
cr- scaled  version  of  the  window  function 

w(x)  =  (  1  ^  ~  2 

^  (0  otherwise 

around  the  test  point  x  then  counting  the  number  of  examples  x,  that  fall  inside  the 
window,  and  normalizing  by  the  total  number  n  of  examples  and  the  window  width 
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Figure  5.2: 


a.  The  value  of  the  estimate  fn(x )  at  x  is  therefore  expressed  as 


r  (  \  1  1  (X  Xi 


This  estimate,  however,  does  not  possess  some  important  properties,  such  as  smooth¬ 
ness,  that  will  be  discussed  later.  We  hence  introduce  a  more  general  estimate  which 
depends  on  a  function  K(x)  called  the  kernel,  that  satisfies 


/oo 

K(x) 

-oo 


dx  —  1. 


That  is, 


fn  (x)  =  -t~K(— 

nfr[a  \  a 


The  estimate  fn(x)  is  still  defined  using  the  same  equation  as  for  the  window  function. 
However  now  it  can  be  viewed  as  a  sum  of  “humps”  centered  at  the  examples  Xi  (see 
Figure  5.2).  The  kernel  function  I<  determines  the  shape  of  the  humps  and  the 
parameter  a  determines  their  effective  width. 

As  cr  goes  to  0  the  estimate  is  a  sum  of  delta  functions  at  the  examples.  Such 
an  estimate  does  not  interpolate  the  data  at  all.  At  the  other  extreme,  as  o  tends  to 
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infinity,  all  the  humps  overlap  and  sum  up  to  a  very  smooth  function  that  hides  all 
the  high  frequency  detail  of  the  underlying  density. 

There  are  several  ways  to  measure  the  goodness  of  the  kernel  estimator.  Viewing 
fn(x)  as  a  random  variable  because  of  its  dependence  on  the  random  sample,  admits 
the  mean  square  error  (MSE)  as  a  measure  of  error,  i.e., 

E(/n(*)-/(x))2 

with  expectation  w.r.t.  the  joint  density  of  the  examples  xi,...,xn.  (Here  x  is 
nonrandom  and  all  the  randomness  resides  in  the  examples.)  The  MSE  can  be  written 
as  a  sum  of  two  terms: 

(E/n(ar)  -  f(x))2  +  var  (/n(z))  =  bias2  (/«(*))  +  var(/n(x)) . 

We  could  also  view  /„(#)  as  a  regular  function  and  hence  use  the  sup  norm  as  a 
measure  of  discrepancy, 

sup|/„(x)  ~f(x)\ 

X 

which  itself  is  a  random  variable  (as  /„  is  random).  So  one  may  define  the  error 
measure  as 

P  (sup  |  fn(x)  -  f(x) |  >  e)  . 

The  event 

(sup  |/n(*)  -  f(x)\  >  ej 

implies  that 

(sup  |/B(*)  -  /(*)|  >  |}  or  (sup  |/»  -  /(*)[  >|} 

where 

f(x)  =  E  fn(x). 
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The  first  term  is  analogous  to  the  variance  and  corresponds  to  a  random  event.  The 
second  term  is  the  magnitude  of  the  bias  of  fn(x )  and  is  a  deterministic  event.  For  a 
good  estimate  we  demand  that  the  probability  of  the  event 

sup  \fn(x)-f(x)\  >  e 

X 

be  less  than  8  where  8  >  0  is  chosen  suitably  small.  This  amounts  to  demanding  that 
the  bias  term 

sup|/(*)-/(*)|  <  | 

with  probability  1  (i.e.,  it  is  a  deterministic  event)  and  that 

P  (sup  |/„(ar)  -  f(x) |  >  0  <  8. 

We  will  refer  to  these  two  components  as  the  random  and  the  bias  parts  of  the 
error.  This  error  measurement  will  be  used  in  this  chapter  because  it  fits  nicely  in 
the  framework  of  uniform  SLLN  convergence  which  was  introduced  in  Chapter  3. 

We  now  review  some  well  known  properties  of  these  two  components  of  the 
error.  We  limit  the  discussion  to  the  univariate  case.  Consider  the  bias  part.  Since 
the  examples  are  identically  and  independently  distributed,  we  have 

/(*)  =  E /„(*)  =  ±  j  f(y)K  (^-1)  dy. 

Using  the  fact  that  by  choice  K(x)  integrates  to  1,  we  have 

/(*)-/(*)  =  \  j  f(y)I<  (L^f)  dy-J  f(x)K(z)  dz 
=  J  f(x  +  crz)K(z)  dz  —  J  f(x)K(z)  dz 
=  J  I<{z)  {. f(x  +  <rz)  -  f(x))  dz. 

Now,  expand  in  Taylor  series  around  the  point  a  =  0  to  obtain 

f{x  +  <rz)  =  f{x)  +  azf'(x)  +  ^a2z2f"(x)  +  ■  ■  ■ , 
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whence 


2 

f(x)  -  f(x)  =  crf'(x)  J  zK(z)dz  +  y/"(x)  J  z2K(z)dz  + 


If  we  let  a  decrease  to  0  with  increasing  n  then  the  bias  goes  to  zero.  The  rate 
of  decrease  can  be  made  faster  by  choosing  K(, x)  such  that  its  first  r  moments  are 
identically  0,  i.e., 

Jl<(x)xidx  =  0,  l<»<r. 

Such  a  choice  of  kernel  guarantees  that  the  first  r  terms  of  the  bias  are  0  so  that  the 
expression  for  the  bias  becomes 

/Tr+1  r 

J  K(z)zr+1  dz  +  OK+1),  <t  >  0. 

However,  if  K{x)  is  to  have  zero  moments  of  order  >  2  then  it  must  take  negative 
as  well  as  positive  values.  The  estimate  /n(.r)  may  itself  be  negative  at  some  points. 
This  is  not  acceptable  if  the  estimate  is  to  be  a  density.  In  our  case  we  will  use  fn(x) 
to  estimate  only  the  modes  of  f(x)  allowing  fn(x)  to  be  negative  at  places  at  need. 
Now  let  us  consider  the  random  part  of  the  error,  namely 

P  (Sl^p  |/»(*)  “  /(*) |  >  |)  • 


By  the  definition  of  fn(x)  this  can  be  written  as 

1 


P  I  sup 


>  ecr 


n  ,=i  / 

where  K„  is  a  class  of  functions  indexed  by  the  scalar  x,  i.e., 

=  {l<Uy)  =  K  :  .x  €  IR.}  • 

Theorem  3.10  bounds  this  probability  by  a  quantity  of  the  form 

1  ,  1  _r 


log  —  ) 
ter  ta  / 


=  8 
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where  we  ignored  some  of  the  constants  which  are  irrelevant  to  our  discussion  here 
and  we  used  the  fact  that  82  in  the  theorem  is  proportional  to  a  if  the  underlying  dis¬ 
tribution  can  be  uniformly  bounded  and  K  is  square  integrable.  Thus  the  confidence 
of  getting  an  e  accurate  estimate  fn(x)  is  at  least  1  —  5. 

The  above  discussion  identifies  two  critical  variables  that  influence  whether  this 
bound  is  small,  and  how  fast  it  decreases  with  n.  First,  the  choice  of  a.  The  bound 
on  the  probability  of  the  random  error  term  is  reduced  as  a  increases.  That  means 
we  can  make  the  accuracy  parameter  e  smaller  while  keeping  the  confidence  5  the 
same,  i.e.,  the  random  error  term  is  reduced  as  cr  increases.  The  intuition  behind  it  is 
that  as  the  “window  size”  a  gets  larger,  the  variance  of  the  estimate  fn(x)  decreases. 
However,  as  we  noted  before,  the  bias  increases  as  a  increases.  This  conflict  of  interest 
demonstrates  one  of  the  fundamental  problems  of  density  estimation. 

The  second  conflict  of  interest  appears  when  trying  to  shape  the  kernel  K(x )  so 
as  to  have  a  bias  that  depends  on  high  order  terms  of  cr  (as  noted  above).  As  we 
saw,  the  bias  can  be  made  to  decrease  faster  if  we  choose  K(x)  which  is  orthogonal 
to  x\  1  <  i  <  r.  However,  as  will  be  shown  in  the  proof  of  Theorem  5.1,  one  class 
)Ca  of  such  kernels,  exhibits  a  VC-dimension  which  increases  as  r  increases.  -As  a 
result,  the  bound  on  the  random  part  of  the  error  increases,  unless  we  increase  the 
error  deviation  e  of  the  estimator  in  order  to  keep  the  same  confidence  1  —  5.  This 
is  obviously  not  desirable.  So  as  we  try  to  exhibit  a  shape  for  the  kernel  that  has 
the  first  r  moments  identically  zero  in  order  to  reduce  the  bias  at  a  faster  rate  w.r.t 
n,  there  is  an  adverse  effect,  coming  from  the  random  part  of  the  error  through  the 
increase  in  VC (fCa).  The  intuition  behind  this  is  that  a  kernel  function  having  more 
zero  moments  is  likely  to  have  more  relative  maxima  and  heavier  negative  parts  over 
its  fixed  support  in  which  case  supa.)J/:|J,_v|<„  \K (x)  —  I\(y) |  =  M  increases.  It  can 
be  shown  by  an  argument  as  in  Lemma  5.6  that  the  covering  number  of  the  class 
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of  functions  increases  as  M  increases.  A  higher  covering  number  implies  a  higher 
VC-dimension  (by  Theorem  3.8)  which  agrees  with  the  above. 

With  some  algebra  it  can  be  shown  that  the  fastest  possible  decrease  of  a  w.r.t. 
n,  while  keeping  a  tight  bound  on  the  random  component  of  the  error,  is  such  that 
logn/n  00  as  n  >  00  (c^  Silverman,  B.W.  [41],  Stute,  W.  [48]).  In  fact,  this 
yields  a  bound  that  decreases  fast  enough  to  achieve  uniform  a.e.  convergence  of 
suPj7  | fn{x)  —  f{x)\  to  0.  In  choosing  the  kernel  one  minimizes  the  bound  on  the 
random  part  w.r.t.  r. 


5.2  Gaussian  Mixture 


In  this  section  we  consider  the  mixed  sample  complexities  for  learning  the  classification 
decision  rule  for  a.  problem  whose  pattern  classes  are  distributed  as  iV-dimensional 
Gaussians  with  unit  covariances  where  the  learner,  in  the  absence  of  specific  para¬ 
metric  side-information  about  the  class,  opts  for  a  kernel  estimation  approach. 

The  modes  of  the  Gaussian  mixture  determine  the  Bayes  decision  whenever  the 
mixture  has  two  modes,  which  holds  if  the  means  satisfy  |#oi  —  $02!  >  2.  This  together 
with  the  fact  that  the  Bayes  decision  border  is  the  hyperplane  equidistant  from  the 
modes  and  perpendicular  to  the  line  i] 2  —  f?i,  is  shown  in  Lemma  5.5.  (Note,  the 
modes  of  the  mixture  do  not  equal  the  means  of  the  class  conditional). 

Algorithm  I<  (shown  below)  is  used  to  construct  the  decision  border  by  first  using 
the  unlabeled  sample  to  estimate  the  mixture  f(x)  by  the  kernel  estimate  fn(x).  Then 
two  modes  fji,  ?/2  of  fn(x)  are  obtained  such  that  they  are  consistent  estimates  of  the 
two  modes  of  the  mixture  /.  The  intuition  here  is  that  for  sufficiently  small  accuracy 
e  >  0  the  main  humps  of  fn  capture  the  modes  j/i,  7/ 2  of  /.  Hence  the  value  f/i  at 
which  fn  is  maximized  over  the  ith  hump  is  a  consistent  estimate  of  the  r]i. 

Using  the  same  analysis  as  for  the  MLE  in  Chapter  4,  the  decision  regions  sepa- 
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rated  by  the  hyperplane  between  the  mode  estimates  yield  close-to-Bayes  misclassifi- 
cation  error  when  labeled  with  the  good  labeling  (as  in  the  MLE  we  also  have  Lgood , 
Lbad  and  use  the  same  number,  m,  of  labeled  examples  to  choose  Lgood  with  some 
confidence). 

We  first  state  the  algorithm,  then  we  state  the  theorem,  followed  by  the  proof. 
(Algorithm  K  can  be  used  to  learn  the  Gaussian-mixture  based  problem  as  well  as 
the  larger  class  of  mixtures  of  general  form  which  is  discussed  in  Section  5.5.  In  order 
not  to  break  the  flow,  we  only  state  the  algorithm  here  while  its  discussion  is  delayed 
to  Section  5.5.  For  the  Gaussian  mixture  problem  we  will  show  the  construction  of 
the  mode  estimates  in  the  proof  of  Theorem  4.2  hence  for  the  moment  it  suffices  to 
focus  only  on  the  main  part  of  the  algorithm  without  procedure  P  which  describes 
the  construction  of  the  mode  estimates  for  the  more  general  case.) 
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Algorithm  K: 


The  setting:  f(x)  has  global  modes,  »?,•,  1  <  i  <  k,  where  k  >  2. 


Given:  n  unlabeled  examples,  m  labeled  examples  drawn  randomly  according  to 

the  unknown  f(x). 


Begin: 


End. 


1)  Use  kernel  estimation  to  obtain  /n(a:). 

2)  Use  procedure  P  to  determine  the  mode  estimates,  fji,  1  <  i  <  k. 

3)  Use  the  mode  estimates  to  construct  a  decision  border  as  the  hyper- 
plane  which  passes  through  the  point  fj, 

1  k 

V  =  T  X> 

K  i= l 


and  which  is  perpendicular  to  the  straight  line  which  is  the  closest 
(in  the  mean-squared-error  sense)  to  fji,  1  <  i  <  k.  (Note,  in  the 
Gaussian  mixture  case  k  —  2  hence  this  step  produces  the  hyperplane 
which  is  perpendicular  to  the  line  through  iji  and  172-) 

4)  Label  the  two  decision  regions  across  the  hyperplane  by  the  label  of 
the  majority  of  the  examples  in  each  region. 
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Procedure  P :  Definitions: 


•  M  =  1  <  i  <  k. 

D  =  {.T  :  f(x]  uViiX)  =  0,  x  ±  T/j,  =  M,  =  M,  1  <  i,j  <  k} 
L  =  sup  f(x) 

xeD 

where  f'(x]  um  x)  is  the  directional  derivative  of  /  at  x  in  the  direction 
of  the  unit  vector  uniiX  whose  direction  is  the  same  as  the  ray  starting 
at  i]i  going  through  x. 

•  Choose  e  <  — f — . 

•  7/i  =  argsup^eIRN/n(x). 

•  Be  =  {.r  :  fn(x)  >  fnifitx )  -  4e}. 


=  {y  :  \y  ~  i?i|  <  inf  inf  \z  -  ^|,  fn(z)  <  /„(t/i)  -  6e}  U  {t?,}. 

for  1  <  i  <  k ,  where  r^t|I  is  a  ray  from  fji  going  through  x. 

•  Y  =  B-AV 

•  7=2. 


Do  While :  Y  ±  0 

1)  ?/,•  =  argsupxey fn(x). 

2)  Y  -  Y  -  Ai. 

3)  7  =  7  -f"  1. 

End  Do . 


End  Procedure . 

We  now  state  the  theorem,  followed  by  its  preview  and  proof. 
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Theorem  5.1  Suppose  uie  are  given  two  classes  which  are  distributed  according  to 
Gaussian  probability  densities  fi(x),  /2(x),  with  means  0Oi,  and  0O2  respectively,  and 
with  unit  covariance  matrices.  Suppose  further  that  0  <  Psayes  <  0-16  (or,  equiva¬ 
lently  | $oi  — $02 1  >  2).  Then  there  exists  a  positive  constant  b  (determined  by  |$oi_$02|j 
such  that  for  0  <  e  <  b  and  arbitrary  6  >  0,  given 

^2^Vlog(5+log  N)  ^ 

n  =  Cl - — — - log -7 

£210  gJV  €0 


unlabeled  examples  and 


m  =  c2  log  - 


labeled  examples,  algorithm  K  determines  a  decision  rule  with  a  classification  error 


Pi error  (^' ,  77  )  —  PBayes(\  T 

with  confidence  at  least  1  —  8,  where  Ci  >  0  an  absolute  constant,  c2  >  0  is  a  constant 
depending  on  Psayes,  and  C3  >  0  depends  on  $0  • 


Remark:  The  restriction  on  Psayes  is  a  consequence  of  the  constraint  on  the  two 
means  of  the  class  conditional  densities  to  be  sufficiently  distant  in  order  for  the  mix¬ 
ture  to  have  two  modes  and  thereby  identify  the  Bayes  border. 


We  now  provide  a  preview  of  the  proof. 


5.2.1  Preview  of  the  proof  of  Theorem  5.1 


As  mentioned  at  the  start  of  Chapter  5,  our  nonparametric  approach  here  is  kernel 
estimation  which  utilizes  the  unlabeled  sample  to  estimate  the  mixture  f(x)  by  the 
estimate 


fn(x) 


n 


£■ 

i=i 


-N 


I< 


(5.1) 
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where  we  use  here  the  notation  £i,  Cn,  to  represent  the  randomly  drawn  unlabeled 
sample  of  size  n.  We  denote  by  x,  a  vector  in  IR^,  and  1  <  i  <  N  denotes  its 
components.  We  will  use  the  particular  kernel  function  defined  by  the  polynomial 
over  the  interval  [—1,1]  as 


I<i 


ELd  aix  i 
o 


l*il  <  i, 

otherwise, 


where  x\  denotes  the  variable  xx  raised  to  the  ith  power  with  r  an  even  integer  and 
with  coefficients  ai  selected  so  that  K\{x i)  is  orthogonal  to  x\  for  1  <  i  <  r  —  1  and 
such  that  f  K1(x1)  dx1  =  1.  (The  subscript  in  Kx  denotes  it  is  a  one  dimensional 
kernel.)  The  A^-dimensional  kernel  used  in  (5.1)  is  defined  as  the  product 


I<(y)  =  Ki(y1)Ki(y2)  ■  •  •  I<i(yN),  V  =  [j/i>  •  •  •  >  Vn}- 


As  mentioned  before,  having  r  —  1  zero  moments  helps  in  the  reduction  of  the  bias. 

The  estimation  discrepancy  will  be  defined  as  the  worst  case  (over  all  x)  deviation 
of  fn(x)  from  f(x)  divided  by  the  absolute  value  of  f(x).  This  allows  us  to  compare 
the  performance  as  a  function  of  dimensionality  N,  i.e.,  fix  an  e- accuracy  between 
Perror  and  P Bayes  to  hold  for  all  dimensions,  and  compare  the  sample  complexities  for 
different  N  with  this  fixed  error  criterion.  As  before,  the  total  error  is  split  into  the 
bias  and  the  random  components 

supg  |/n(s)  -  f(x)\  <  sup,  |/(-r)  -  f(x) I  +  sup,|/w(a;)-/(x)| 
sup  Xf(x)  ~  sup  Xf{x)  sup  Xf{x) 

Because  we  are  interested  in  estimating  the  unknown  distribution  f(x)  uniformly 
for  all  x  £  IR;V  by  the  estimate  fn{x),  the  aim  will  be  to  transform  the  random 
component  of  the  error  partly  into  a  uniform  SLLN  convergence  over  a  class  K.a  of 
bounded  functions  using  the  same  truncation  ideas  as  before.  Each  function  in  ICa  is 
indexed  by  a  point  x  G  D  C  IR'V ,  where  D  is  a  suitably  chosen  compact  set,  and  hence 
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uniformly  approximating  this  class  of  functions  gives  the  uniform  approximation  of 
f(x)  over  D.  The  uniform  SLLN  cannot  be  applied  over  Dc  as  it  is  a  non  compact 
region.  However,  we  can  choose  D  so  that  the  magnitude  of  f(x),  /n(x),  and  hence 
their  difference,  is  sufficiently  small,  to  obtain  a  good  estimate,  fn(x),  uniformly  for 

x  e  TR,N. 

We  first  proceed  to  determine  a  bound  on  the  bias  component.  We  show  that 
there  exists  a  coefficient  vector  a  which  satisfies  the  above  orthogonality  conditions 
on  the  kernel  K\.  Lemma  5.3  is  used  to  bound  the  magnitude  from  above. 

This  bound  is  needed  for  both  the  bias  and  the  random  parts. 

The  bias  is  then  expanded  in  a  Taylor  series  with  remainder  around  a  =  0  which 
yields  a  polynomial  in  a  whose  coefficients  depend  on  the  first  r  +  1  derivatives  of 
the  unknown  mixture  f(x),  which  is  a  Gaussian  mixture  in  this  case,  and  also  on  the 
first  r  moments  of  the  one  dimensional  kernel.  Using  the  orthogonality  of  K\  we  are 
left  with  one  term  which  depends  on  a  bound  of  f^T\x)  and  on  the  rth  moment  of 
K\.  The  former  is  bounded  using  the  theory  of  Hermite  polynomials,  see  Lemma  5.4, 
and  the  latter  is  bounded  using  the  results  of  Lemma  5.3. 

The  random  part  of  the  error  is  bounded  using  the  uniform  SLLN  (Theorem 
3.10).  Adapting  an  approach  from  Pollard  [21],  we  view  the  estimate  /„(x)  as  the 
empirical  mean  of  a  class  of  A^-dimensional  functions  Ka<x(y)  =  K  each 

indexed  by  a  “parameter”  x  €  D  C  B^.  The  uniformity  that  is  sought  for  the 
estimate  fn(x)  over  all  x  6  ffv  is  achieved  by  invoking  the  uniform  SLLN  theorem 
over  the  class  The  reason  that  Theorem  3.10  and  not  Theorem  3.9  was  used  here 
is  due  to  the  1  /aN  factor  which  is  a  part  of  fn(x)  and  is  allowed  to  increase  as  the 
sample  size  n  is  increased. 

We  cannot  let  K„iX{y)  be  defined  simply  as  (i^£)  because  this  would  make 
the  magnitude  of  functions  Ka>x  in  the  class,  K,a,  depend  on  n;  in  particular,  the 


109 


magnitude  of  K„tX  would  increase  with  n. 

The  bounds  on  the  random  part  of  the  error  involve  the  quantity  VC(K-a).  This  is 
evaluated  easily  by  Theorem  3.6  together  with  Definitions  3.4,  3.5.  We  have  chosen  the 
one  dimensional  Kernel  to  be  an  (r-  l)th  degree  polynomial,  i.e.,  a  linear  combination 
of  basis  functions,  specifically  in  order  to  be  able  to  apply  Theorem  3.6  and  yield  a 
bound  for  VC{K :„)• 

Finally,  we  combine  the  bounds  on  the  two  error  parts  and  deduce  the  finite 
unlabeled  sample  complexity. 

With  that  accomplished,  we  then  describe  the  learning  procedure  (which  is  based 
on  algorithm  K  that  is  described  in  details  in  Section  5.5)  and  show  how  the  modes  of 
fn(x)  yield  consistent  estimates  of  the  modes  of  /(x).  Using  the  same  analysis  as  in 
the  parametric  cases  of  Chapter  4  we  use  the  small  labeled  sample  with  the  majority 
rule  to  confidently  pick  the  good  labeling  of  the  partition. 

We  first  collect  auxiliary  lemmas  needed  in  the  proof  of  the  theorem  to  avoid 
breaking  up  the  flow  subsequently. 

5.3  Auxiliary  Lemmas 

The  following  lemma  can  be  found  in  Szego  &  Polya  [25,  page  89]. 

Lemma  5.2  For  any  arbitrary  polynomial  P ( x )  =  Yfi=o  aP'  u,ith  real  coefficients  al 
such  that  fli{P{x))2  dx  =  1  then  for  -1  <  x  <  1  we  have  |P(x)|  < 

Proof:  P{x)  is  a  an  arbitrary  polynomial  of  degree  r  hence  can  be  expanded  using 
the  Legendre  basis  as 

k= 0  V 

Then  by  the  condition  on  P(x)  we  have 

I  P2(x)  dx  =  +  a2  +  . . .  +  «2  =  1. 


no 


From  Holder’s  Inequality  we  have 


J2a*h'  -  E  \aibi\  -  ^Ea^Efe< 

hence  (E<  a;  6,0  2  <  E;  “1 E.  H-  Regarding  yffi±Pk(x)  as  a  sequence  in  k  we  have 

2k  +  l 


P\x)  =  E 


ak\ 


12k  -|- 1 


ftW]  <E>i)E^J?(*)<E 


because  |Pfc(.r)|  <  1  for  |.r|  <  1  (the  reason  for  that  is  provided  below).  Finally, 


i£2i  +  l  =  I>+i(.-  +  l)  =  ^ 


k-0 


0 


which  proves  the  theorem. 

Now  we  show  that  for  the  Legendre  polynomial  Pn{x)  we  have  \Pn(x)\  <  1  over 
\x\  <  1.  We  first  prove  that  Pn{x)  satisfies  a  recursion  equation,  then  from  this  we 
find  the  generating  function  for  the  sequence  Pn(:c),  hi  n.  Denote  the  coefficient  of 
xn  of  Pn{x)  as  kn. 

pn(x)  -  -^-xPn-i(x) 

Kn- 1 

is  an  (n  —  l)-polynomial  hence  can  be  expanded  as  linear  combination  of  Legendre 
basis, 

PM  - 

kn— 1  k—0 

Multiply  both  sides  by  (Pt(x),  •)  where  (•,•)  =  dx.  Note  that 


.  (xPn-\,Pi)  =  {xPi,Pn- 1) 

and  x.Pt(x)  is  an  (i  +  l)-polynomial  therefore  if  i  +  1  <  n  —  1  then  (xP{,Pn- 1)  =  0. 
Also,  (Pn,  Pi)  =  0  if  i  <  n  -  2.  So  the  LHS  is  zero  for  i  <  n  —  2.  The  RHS  is 

Co(Po,  Pi)  +  .  .  •  Cn_3(P„- 3,  Pi)  +  Cn-2(Pn- 2,  Pi)  +  Cn-l(Pn-l,  Pi)- 
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If  i  =  0  then  the  RHS  is  c0(P0,  PQ)  hence  cq  =  0.  Similarly  with  i  =  1, . . .  ,n  —  3.  So 


Cq  0\  —  Cfi— 3  “  0.  So 


Pn(x)  =  (-r^-X  +  Cn_i)Pn_i(x)  +  Cn_2Pn-2(®)- 

"n—l 


Now  we  show  that 


Pn(x)  =  Pn(-X)(-1)". 


We  have 

J  ^  Pn(-x)Pm(-x)  (lx  =  J  ^  Pn(x)Pm(x)  clx  =  <$nm. 

So  the  polynomials  {Pn(— x)}  form  an  orthogonal  basis  and  hence  we  can  expand 

pn{x)  =  ]T  akPk{-x ) 

k=o 

since  Pn(x)  is  an  n-polynomial.  Now,  (Pn(x),  P;(— x))  =  0  when  i  <  n  so  a;  =  0  for 
i  <  n.  So  P„(x)  =  anPn(—x).  Equate  coefficients  of  xn  and  find  that  an  =  (—1)". 
Hence  Pn(x)  =  (-l)nPn(-x). 

With  this  we  can  find  the  value  for  cn_ i.  Clearly,  (  — l)”Pn(— x)  satisfies  the 
recurrence  equation  hence 

(-l)”Pn(-x)  =  (~J~~X  +  C^X-ir-'Pn^i-x)  +  C„_2(-ir2Pn_2(-x). 

k-n-1 

Subtract  it  from  the  original  recursion  and  we  get  cn_x  =  0.  Now  to  find  cn_2  we 
use  the  fact  that  Pn(l)  =  1  which  can  be  seen  from  the  general  formula  (cf.  Szego  & 
Polya  [25]) 

(where  we  use  the  notation  0°  =  1).  We  have, 

a,(l)  =  r2-  ■  1  •  -Pn-l(l)  +  Cn-lPn- 2(1) 

kjl-1 
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kn—i  n 


which  implies  c„_ 2  =  1  —  •  Finally,  it  is  easy  to  show  that  kn 

kn - 221=1  and  cn_  2  =  -s^i- 

Hence  we  proved  the  recurrence 

2  n  -  1 


_ 


2"(n!)2 


SO 


F„(:e)  = 


n 


72  —  1 

“XPn— l(*E)  ~  -^n— 2(*^)* 


n 


Now  from  the  recurrence  we  can  get  the  generating  function  of  P„(x), 

fiW)  =  £  Pn(z)wn- 

n>0 

We  have 

nPn  =  (2n  -  l)xP„_i  -  (n  -  l)Pn-2- 
We  multiply  both  sides  by  £n>o  wn~l  and  after  some  manipulation  get 

f'(w){  1  —  2xtu  +  u;2)  =  (x  —  w)f(w). 


This  yields 


f(w)  = 


1 


vT  —  2xto  4-  w2 
Let  x  =  cos  0.  Plug  into  the  g.f.  and  get 

1  1  1 


yjl  -  2  c,os(#)u;  +  xu2  VT  -  et(9w  \/I  - 


Using  the  identity 

Tra-Sf")-1 

the  right  hand  side  becomes  Y.k> 0  ^  J  4 ~ket6kwkY!,j>o 

the  coefficient  of  wn  on  both  sides  we  have 


2k  \  ,v».  /  2j  \  4 -je-iejwj'  Taking 


P71  (  COS  0 ) 


E  (V)  (V) 

fe=n 

(S)  (2n")  4-"2cos(77.^)+  (?)  (2(rriX))  4_n2cos((n  —  1)0) 


+  •  •  •  +  (.7/2)  (n/2)  4  ”• 
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All  the  terms  multiplying  the  cosines  are  positive  and  cos(-)  <  1  so  therefore  |P„(cos  #)|  < 
Pn(l)  —  1  where  the  last  equality  we  already  established  above.  This  proves  that 
\Pn(x)\  <  1.  I 


Lemma  5.3  Suppose  the  polynomial 


r—  1 


t=0 


satisfies 


J  Ki(xi)x\  dxi  =  Sj0 

for  0  <  j  <  r  —  1 ,  where  8jo  is  the  Kronecker  delta.  Then  for  an  even  integer  r, 


In  particular ,  as  r  — >  oo  through  the  even  integers, 


r 

«o  ~  — • 

TT 


Proof:  By  Cramer’s  rule  we  have 

0 

o 

0 

9r— 1 


do  = 


3  0  ?  A 

0  J  TTY  0 


0  *  •••  0  0 

7TT  0  •••0  *3 


0 

1 

r+1 

1 

2r  — 3 


1 

r-fl 

o 


2r— 1 


9  r 


i  o 
0 


1 
3 

i  0 


0 


i  •••-40 

5  r  — 1 

1  ...  _i_  0  — 

5  r— 1  r+1 


*  0  7+T  0  •••  0 

0*0  . 


*3  0 
0 


(5.2) 
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We  will  now  reduce  the  determinant  of  the  denominator  as  follows:  number  the  rows 


and  columns  from  0  to  r  —  1  (with  r  even).  Consider  the  matrix 


1 

0 

1 

3 

0 

1 

5 

1 

r  — 1 

0 

1 

3 

0 

1 

5 

1 

r  — 1 

0 

1 

0 

1 

—  0 

1 

3 

5 

r  — 1  U 

r-f  1 

1 

0 

1 

0 

...  0 

1 

r  — 1 

r-f  1 

2r-; 

0 

1 

r-f  1 

0 

. 

0 

0 

i 

r-fl 

0 


2r— 1 


Note  that  row  2 j  can  be  written  as 


1  0  1 


2j  +  1  2j  +  3 

while  row  2 j  +  1  can  be  written  as 


0— ^ — -0  1 


0  •••  0 


1 


2 j  +  r  +  1 


0 


0  ...  0 


1 


[  2j  +3  2j  +  5  2j  +r  +  1 

Now  multiply  all  even  numbered  rows  2 j  for  j  =  1, 2, 3, . . . ,  (r  —  2)/2  by  —1,  whence 
the  determinant  becomes 


(-1)( 


r-2)/2 


i 

0 

1 

3 

0 

1 

5 

1 

r-1 

0 

1 

3 

0 

1 

5 

1 

r— 1 

0 

-1 

0 

-1 

-1 

0 

-1 

3 

5 

r  — 1 

r-f  1 

-1 

0 

-1 

0 

0 

-1 

r-1 

r-f  1 

2r— [ 

0 

1 

r+1 

0 

0 

1 

r-f  1 

o 


i 

2r— 1 


Now  add  the  top  row  to  all  the  rows  that  start  with  a  non-zero  element  yielding 

1  0  §  0  i  ■  -JL 

0  |0  i  ••• 

0 


(-1)( 


r-2)/2 


—  o  — 

31  U  5-3 


(r— !)(r— 3) 


•••  0 

r— 1 

0  7+T 

-  0 


r  — 2 


0 


r— 2 


(r  — 1)1  v  (r-f  1)3 

0  f  0 

r+1 


0 


0 


(r+l)(r-l) 

r-2 

(2r-3)(r-l) 

0 


0 


2r— 1 


First  factor  out  the  numerators  of  alternating  rows  and  get 

1  0  i  0  l 


( — !)(’  2>/2(r — 2)(r — 4) . . .  (4)(2) 


0 


3-1 


1 

3 

0 


0  5^3 


0 


1 


(r— l)(r  — 3) 


1 

r— 1 

0 

0 


(r— 1)1  u  (r+l)3 

0  7+T  0 


(r+l)(r-l) 

1 

(2r-3)(r-l) 

0 


0 

1 

T+ 1 

0 


2r— 1 
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Then  factor  the  denominators  of  alternating  columns  to  get 


0  1 
i  0 


(-D( 


r-2)/2  (r  —  2)(r  —  4) . . .  (4)(2)  I 

(r  —  l)(r  —  3)  •  •  •  (3)  •• 


1  •••  1  0 

-  A  0  * 

FT)  0  (r+l)  0 


®  (2r— 3)  ® 

•••  0 


Now  repeat  the  operation  on  the  columns:  start  with  columns  whose  top  element  is 
1  (excluding  the  first  column),  and  multiply  them  by  (  —  1)-  The  determinant  now 
becomes 


-1  0 

0  * 


1  0-10-1  •••  -1  o 

o  I  o  l  •••  7TT  0  - 

(  1  ~  2)(r  —  4) . . .  (4)(2)  |  0  ^  •••  j^j  0  Fk)  0 

1  j  (r  —  l)(r  —  3)  •  •  •  (3)  .  ‘ 

(^T)  0  ^  0  0  ^  0 

0  Til  0  .  0  2 


0  (2r-3)  0 

•••  0 


Since  r  is  even,  then  (— l)r_2  =  1.  Now  add  the  first  column  to  all  the  columns  that 


start  with  a  —1  to  get 


0  0 
i  0 


(r  _  2)(r  —  4) . . .  (4)(2) 
(r  -  l)(r  -  3)  •  •  •  (3) 


(r+l)(r-l) 

0 


o  •••o 
-b  0 

r  —  1 

r-4  n  r—2 

(r— 1)3  U  (r+l)3 


f)  _ r-=l -  o 

U  (2r-3)(r-l)  U 

•••  0  TT 


factor  out  the  numerators  from  the  alternating  columns  to  get 


0  0 
\  0 


((r-2)(r-4)...(4)(2))2  | 

(r  -  l)(r  -  3)  •  •  •  (3) 


i  0  — 

(r— 1)3  U  (r+l)3 


_ 1 _  0 

(r+l)(r-l) 

0 


_ l -  o 

(2r-3)(r-l) 

0  o; 
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And  finally  factor  the  denominators  from  the  alternating  rows  to  get 

1  0 


'  (r  —  2)(r  —  4) . . .  (4)(2)  \ 
V  (r  —  l)(r  —  3)  •  •  •  (3)  ) 


0 

0  |  0 
1  o  i 


0  0 
1 
5 


(r — 1) 


•••0  0 

-L-  0  -L- 

r— 1  u  r+1 

0 


(r+1) 


r+1 

0 


1  0 
0 


hte  o 

7+T  0  •' 


0 


_J_  n 

(2r-3)  U 

0  1 


2r— 1 


The  determinant  in  the  above  expression  equals  the  determinant  on  the  numerator 
of  (5.2).  So  the  denominator  of  (5.2)  is 


(r  ~  2)(r  —  4)  •  •  •  (4)(2)  V 
v  (r  —  l)(r  —  3)  •  •  •  (3)  ) 


•  2r 


1 

3 

o  1 

1 

r  — 1 

0 

1 

r+1 

0 

I 

5 

1 

r  —  1 

0 

1 

r+1 

0 

0 

1 

r+1 

0 

0 

1 

2r— 3 

0 

1 

r+1 

0 

0 

1 

2r— 3 

0 

1 

2r-: 

and  so  «0  =  I  ( (r-2)(r-4)3-  (4)3(2))  •  Simple  manipulations  results  in  the  alternative  form 


«o  =  J  (2_'  U)  ): 


A  simple  application  of  Stirling’s  formula  to  the  central  term  of  the  binomial  gives 

V2 


2"r  U) 


y/nr 


as  r  ►  oo.  This  completes  the  proof. 


Lemma  5.4  Let  f(x)  be  a  Gaussian  N -dimensional  density,  then 

sup  [HU . ,»<r(*)|  <  C{2ir)~N/2rhr 

X  1  z 

where  il7i2, . . . ,  ir  €  {1,2,...,  N},  and  C  is  some  positive  constant. 
Proof:  We  need  to  bound 


dti  d t2 

dtr 

dx't  dx%  ' 

'  dxf 
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where  0  <  U  <  r,  and  YliLi  ti  —  r,  and  use  the  convention  that  d°/dx°f(x)  =  f(x). 
Without  loss  of  generality  suppose  the  derivatives  are  taken  w.r.t.  X\ i.e., 

dx?d4  dx?  n  ’ 


where  1  <  U  <  r,  ti  =  r.  Clearly  1  <  /  <  r.  We  will  suppress  the  (27r)  factor  for 


brevity.  This  can  be  written  as 

dh  -ix2 

—re  21 
da:]1 

^  -lx2 

— r-e  2  2 
da:]2 

< 

d*1  _ ix2 

da:]1 

dtl  _ix2 


dx\l 


e  2xi 


-i*2 
e  2*1+1 


-ix2 
e  2  N 


d*2  _i,2 

dtr  _LX 2 

—re  22 

dx*2 

d^e  2  ' 

We  first  bound  any  one  of  the  one  dimensional  factors,  i.e.,  denoted  by 

dn  _i*2 


dxn 


e  2 


From  the  theory  of  Hermite  polynomial  we  have 


So 


and  therefore 


dn  _ir2 

~  6  2 
dxn 

We  now  bound  the  right  side.  Using  the  identity  (cf.  [26,  page  102]), 


ln/2j 


n! 


*•<*>  = 


(2x) 


n—2k 


the  above  becomes 
dn 


dxr 


-ix2 
e  2 


n/2 


^(-l)fe..,  U‘ _ 2~kxn~2ke~*x* 


'i—2k^—  ^x2 


k= 0 


fc!(n  -  2 A: ) ! 
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where  we  used  n/2  instead  of  [n/2j  for  simplicity.  This  is  further  bounded  by 


n/2  i 

Y"  n-  o-fc  n-2fc  -±*2 

£«(»-2t)l2  x  e  ' 


fc!(n  -  2k)\ 

Also,  simple  differentiation  shows 


- 2~k  <  2~k  — - 


(n  -  2 Jfe)! 


xae~?x2 


{a\a!2 

<(-)  ,  a  =  1,2,3,... 


Thus 


ile-K  < 

5a:n  (n  - 


/nW2  _  1  / 2ni\,/ 2 


-  (1)”'  E  ±(t) 

v^/  t=n,n-2,...,0  X  C  7 


By  Stirling’s  formula, 


an  _:x2  /nW2  v-  f2ne\i/2 

a?**  <c‘(2)  . ,(-) 


where  Ci  =  l/\/2i r,  and  using  the  convention  that  (1/0)°  =  1.  The  function  (—p) 
increases  monotonicaUy  in  the  range  0  <  y  <  n  since  its  derivative  is  positive  there. 
Thus  the  sum  is  bounded  by  \ (2e)n/2  and  we  have 

£«-*■*  <*(:T>rn  =  ^lw*'n 


where  C2  is  an  absolute  positive  constant.  Finally,  we  have 

.W2^/2...**,/2^/2  <  C3m/V/2. 
axi1  ax*2  ax;1  _  _  t=i 
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Now,  recall  that  1  <  U  <  r,  and  £-=1  U  =  r ■  We  can  use  the  following  bound  (cf. 
Marcus  &  Mine  [42,  page  106]) 


Simple  differentiation  gives 


hence 


d ll  dt2 

0x['  dx\2 


QU 

w 


/(2tt  )n/2 


C3rr/2er(l/2+l/e)  rr/2gr 

-  (2^2  -  C3(2 


Lemma  5.5  The  N -dimensional  Gaussian  mixture  with  unit  covariance  matrices  and 
equal  a.  priori  class  probabilities  has  two  modes  whenever  the  means,  Q\,  92  of  the  class 
conditional  densities  satisfy  |0i  —  92\  >  2.  In  this  case,  the  modes  determine  the  Bayes 
border  (hyperplane) . 


Proof:  The  modes  of  the  mixture  are  denoted  by  ifo,  r/2.  We  have 


/(*)  = 


1  e-il*-".!1  +  1 


2(2tt)n/2 


2(2ir)N/2 


First,  translate  the  frame  so  that  the  point  whose  coordinate  vector  is  9X  is  at  the 
origin.  Then  transform  to  a  new  primed-coordinate  system,  x1  =  Qx  s.t.  the  coordi¬ 
nates  of  the  means,  9[  and  9'2  are  on  the  x^-axis.  (We  will  still  denote  the  point  by 
9[  although  it  is  the  origin).  This  is  simply  a  rotation  hence  Q  is  unitary  and  the 
Jacobian  equals  1  yielding 


f(x')  = 


1 


2(2n)N/2 


-i|0T.-r'-Qre;|2 


e  2 


+ 


l(2ir)Nl2 


-k\ Q1 


x'-Qt$'2\2 
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1 


2(2; r)N'2 
1 

2(2tt)*/2 


e-%\x'-0[\2  + 


1 


2(2t r)^/: 


■e  2 


^e-2(*i-tfn)2  +  e  2^x*  e 


l,.'2  -It-'2 

2^2  g  2X3 


_JU<  2 

.e  2  77 


Clearly  the  x'  which  maximizes  /(x')  has  x'2  =  . . .  =  x'N  =  0.  So  the  modes  must  be 
on  the  x^-axis  and  hence  on  the  line  through  the  means.  We  differentiate  f(x')  w.r.t. 
x\  and  equate  to  zero  getting 


e-§(*l-flli)2  0'21  —  x[ 

=  x\-0'xl 


The  left  side  is  positive  hence  the  solutions  for  Xj  must  be  between  0'lx  and  021. 
Additional  manipulation  yields 


xn  = 


O’n  -  0 


02i  —  xi  -  ^21  d"  ^ii 

—  los  — — —  + - 


21 


-  0i 


11 


and  substituting  y 


xi  - 


fl21  +gll  n  —  g21  ell 


m  n  — 
2  >  w 


gives 


1  ,  V  +  a 
V  =  TaX0i—y 


(5.3) 


(Note,  a  =  \9\  —  02|/2).  The  right  side  is  an  odd  function  around  zero.  At  y  =  -a 
and  y  =  a  it  equals  -oo  and  oo  respectively.  Taking  its  derivative  w.r.t.  y  yields 
- which  never  equals  zero  hence  it  has  no  critical  points.  There  are  two  cases  for 

az—yl  1 

the  derivative  a.tj/  =  0:  (1)  <  1  which  happens  when  a2  >  1,  (2)  >  1,  occurring  for 
a2  <  1.  Case  (1)  implies  that  the  right  side  of  (5.3)  intersects  the  line  (the  function 


on  the  left  side  is  a  line  y)  only  at  the  two  points  ya  and  yb  (besides  0)  which  are 
equidistant  from  0.  This  implies  f(x')  has  three  critical  points:  at  x'  =  11  -  21  (cor- 
responding  to  y  =  0),  and  at  two  points  which  are  equidistant  from  11  2  21  •  The  first 
is  a  relative  minimum  and  the  other  two  are  the  modes  hence  the  mixture  has  two 
modes.  So  in  case  (1)  we  showed  that  the  modes  are  equidistant  from  the  average  of 
the  means  (which  is  where  the  Bayes  hyperplane  passes)  hence  the  hyperplane  passes 
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through  the  average  of  the  mode,  and  the  line  through  the  modes  is  perpendicular  to 
the  Bayes  hyperplane.  Therefore  the  modes  determine  the  Bayes  hyperplane  under 
the  condition  that  |0x  —  02\>  2.  In  case  (2),  there  is  only  one  extrema,  and  it  is  a 
maximum  at  x\  =  9-n~21'\  the  mixture  has  only  one  mode.  The  Bayes  hyperplane 
goes  through  this  point,  however  it  is  not  possible  to  determine  which  of  the  infinitely 
many  possible  hyperplanes  is  the  Bayes  border  since  the  line  perpendicular  to  the 
Bayes  hyperplane  cannot  be  determined. 


Lemma  5.6  The  class  of  kernels  can  be  finitely  covered. 

(We  prove  this  lemma  since  finite  coverability  is  a  necessary  condition  for  Theo¬ 
rem  5.1,  in  which  it  is  stated  as  a  permissibility  condition.  ) 

Proof:  In  denoting  the  class  Ka  =  {Ka^x(y)  :  x  €  D)  and  iV-dimensional  ker¬ 

nel  Kv'X,  we  will  omit  the  a  since  it  is  the  same  for  all  functions  in  and  D  is 
compact  subset  of  IR^.  In  the  following,  all  vectors,  such  as  x,y  are  in  TR,N.  We 
first  show  that  for  a  fixed  x  (a.  center  of  a  sphere  in  the  covering  of  D),  and  fixed  e, 
supXe  E  \I<-X{y)  —  Kxfiy)\  <  ce,  where  xc  E  1RW  is  s.t.  |x  —  .tc|  <  e,  and  c  >  0  is  a 
constant.  We  assume  that  the  distribution  of  y  is  absolutely  continuous  and  denote 
the  pdf  by  /(?/);  we  also  need  that  /(?/)  has  a  probability-1  support  containing  the 
region  {z  :  \z  —  y\  <  1,  y  E  D}.  In  what  follows  we  denote  the  one  dimensional  kernel 
by  -  x  i)  =  poly(yt  -  ®f)l|w-*i|<i,  l  <i<  N.We  start  with: 


sup  E  \I<x{y)  ~  KX({y) | 

Xc 


=  sup 

Xc 


f  poly(y i  -  x(i)l\yi-Xel\<ipoly{y2  -  •'c£2)l|J/2-^2l<i  ‘  •  • poly{yN  -  ^cN)l\yN-xeN\<i 


-  poly{y i  -  xi)l\yi-x1\<ipoly(y2  -  ,T2)l|i/2-s2i<i  •  "poly{ys  -  %n) l|yw-*w|<i 


f(y)  dv 
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In  the  above,  let  Ai(y,xe,x)  denote  the  quantity  inside  the  absolute  value.  We  need 
to  define  the  following: 


L\,xt,x  —  {yi  '  I  V*  Xe*\  '>  I  Vi  Xi\  — 

L2,xt,x  =  :  IVi  -  Xei\  <  1,  l?/i  “  Xi\  <  1) 

Li,X',x  =  {?/»•  :  \yi~xd\  <  l,lv»  -  xi\  >  1} 

where  subscript  i  denotes  ith  component  of  a  vector.  Continuing  we  have 


ip  J  \Ai{ij,xc,x)\f(y)dy 

=  sup  J  \Ai(y,x€,  x)\l{yieLi  X'_yf(y)  dy  +  sup  J \Ai(y,xt,x)\l{yieLi^_jf(y)dy 
+  sup  J \Ai(y,Xi,x)\l{neLiX'i}f(rj)dy  (5.4, 


The  first  term  equals 


sup  /|po/?/(?/i  -  >'C£l)l|j/1— xel|<ll{yi6Lj  x  .J  II  P°ly(yj  ~  Xzj)l\Vj-x<A<  1 

xe  J  \  ’  “  j= 2 

N 

-  poiy{y\  -  ,e  s>  II  p°lu(yj  -  %)1iw-sJi<i  f(y )  dv 

J=2 


=  sup  J  \poiy(yi  -  *i)i{W6Lji;r<i5.}  n  p°ty(yj  ~  xj)Mvi-*j\<i\f(y)  dy 

since  l{yieiiii,Cij}1l»-*«il<i  =  0  and  1{vieLj^,*}1lw-s»l<i  =  ^i^}, *«,*)•  The  above  1S 


<  C 


7Ar_1  sup  / 


|3/2-*2|<l.  — .|3/W“*n|<1 


/(t/2,..  ->yN)J  f(yi\y2,---,yN)dy 


where  C  bounds  the  one-dimensional  kernel  over  its  support.  Now  the  only  factor 
depending  on  x(  is  L\  X(  i.  which  is  C  L\  (  where 

L\,(  =  =  l**i  -  ®i|  =  e 
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(recall  that  \x  —  xt\  <  e  hence  |xel  —  Xi\  <  t).  That  is,  L\t  is  the  one  interval  that 
corresponds  to  L\  £  with  the  specific  xt  that  is  e  away  from  xx.  Hence  the  above  is 

<  CN~1  I  I  f(yi\y2,---,yN)dy1f{y2,---,yN)dy2"-dyN 

<  CN~1M1e 


where  we  assume  /(yi|y2,  •  •  • ,  2/iv)  <  M\  over  the  support  of  the  integral,  for  some 
positive  constant  M\.  Note,  because  of  the  initial  assumption  on  the  probability-1 
support  of  /(?/)  it  follows  that  L\  c  cannot  contain  all  the  probability  mass  and  hence 
the  above  follows. 

We  can  similarly  bound  the  third  term  of  (5.4)  by  CN~1M3e,  for  some  positive 
constant  M3,  using  the  fact  that  l{yi&q Xe fj.}l|w-ii|<i  =  0  and  1{j/ie^>^  3r>1|yi-^«i|<i  = 
l{3/ieLj  ^  We  use  the  fact  that  1  {vi 1  Im  — l<i  =  l{yi€LiiXtiS}  an<^  l{yiei4iXeii}l|i/i-*eil<i 
=  l{yieLl  .}  to  get  the  second  term  of  (5.4) 

sup  /  \poly(yi  -  x£i)l{yigM  }  poly{xjj  -  *«)l|w-*„-|<i 

J  ’  “  j= 2 

N 

-  poly(yi  -  xi)i {yieLi  ,}  [J  voly {y 3  -  *i)i|y,-*,i<i|/(y)  dv 

3= 2 

Denote  the  quantity  in  the  absolute  value  as  A2(y, xe,  x).  We  break  it  as 


SUp  J  |^2(y,*oX)|l{yae^i„ rx}f(y)  dV 

+  j  |  A2{y,xt,x)\l{y2zL2'X"_}f(y)dy  +  J  |A2(y,xe,x)|l{y2€L2^_}/(y)dy 


Using  the  same  ideas  as  before,  we  find  the  first  and  third  terms  are  bounded  from 
above  by  some  positive  constant  multiple  of  e.  Then  continue  to  break  the  second 
term  into 


sup  j  poly(y  1  -  xti)l{yieit  }P0ly{y2  ~  *«2)l{w6^i,€ii}  II  ~ 

Xe  J  ’  ’  ’  *  _ o 

f{y)  dy 


3=3 


N 


-  polyith  -  *i)l{weL5  }Po/y(y2  -  x2)l {ia6L2  }  II  P°ly(Vi  ~  %)1bi-^l<i 

3=3  ‘ 
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Denote  the  quantity  in  absolute  value  by  As(y,  xe,  x)  and  continue  to  break  as  before 
until  we  end  up  with  all  terms  which  are  bounded  from  above  by  some  positive 
constants-multiple  of  e,  and  one  term  which  is  as  follows 


■/ 


sup 

Xt  Jyi€LlXei . yN€L?Xe  S 


<  sup 

yi£L\ 


2,x€,x’ 


< 


sup 

N  N 

Y[poly(y*  -  **•)  -  Y[poly{y*  - 


AT  AT 

II  -  *«■)  -  II  -  *••)  f(y)  dv 

i=\  i-l 

N  N 

T[poly(yi  -  xti)  -  JI poly(yi  -  xi) 

i=i  «=i 


(5.5) 

U'=l  i= i  I 

where  x*,y*  are  where  the  maximum  is  achieved  (it  is  achieved  since  each  of  the 

one-dimensional  kernel  functions,  KX€i{Vi)i  is  bounded  over  the  set  on  which  (xe,y) 
vary).  But  each  of  the  one-dimensional  factors, 


\poly{y*  ~  x*i)  -  poly(y*  -  i,)|  <  M/e 


where  M[  is  some  finite  constant,  because 

sup  sup  \poly(yi  -  xti)  -  poly(yi  -  ®*)l  <  sup  sup  | poly(yt  -  xei)  -  polypi  -  *,-)l 

Xei  VitL*  - 

2,Xei,Xi 

since  L\X(.  £.  C  {y{  :  | y{  -  xt  |  <  1}.  And  therefore  the  above  is  bounded  by 

sup  | poly{s)  -  poly(t)\  <  M/e 

—  1  —  e<s,t  <.ri  +  l  +  e,|s-<|<e 

because  \xtt  —  £-,  |  <  e,  and  polyQ  is  continuous  over  the  compact  set 

{s,t.  :  Xi  —  1  —  e  <  s,  t  <  Xi  +  1  +  e}. 

Hence  in  (5.5),  poly{y*  -  x‘d)  =  poly{y*  -  *,•)  +  c0e  for  some  constant  c0  >  0;  by 
inspection  it  is  clear  that  (5.5)  becomes  a  constant  multiple  of  e  for  some  positive 
constant. 

Hence  sup^  E  \Kx{y)  —  KXe{y)\  <  ce,  for  some  positive  constant  c.  Now,  for  any 
I<x(y)  £  tCcr,  (hence  x  £  D),  there  is  a  xk,  a  center  of  a  sphere  in  the  covering  of  D, 
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such  that  \x— xk\  <  e.  Corresponding  to  this  xk,  3  Kik(y)  3  E  \Kik(y)  -  Kx(y)\  <  ce 
because  we  showed  it  is  true  for  any  xt  with  |x£  —  xk\  <  e.  This  implies  the  existence 
of  a  finite  collection  of  functions  {K-Xl(y),K-X2(y),  . . . ,  I<icov(D)(y)}  which  covers  Kc 
in  the  Li  norm  to  an  accuracy  ce.  I 

5.4  Proof  of  Theorem  5.1 

We  will  use  here  the  notation  Ci, . . .  ,(n,  to  represent  the  randomly  drawn  unlabeled 
sample  of  size  n ,  where  n  is  stated  in  Theorem  5.1.  We  denote  by  x,  a  vector  in  Dl'^ , 
and  xul<i<N  denotes  its  components. 

Initially  we  show  that  with  this  ?i-sample  it  is  possible  to  estimate  f(x)  by  /n(x) 
to  within  small  deviation  where  the  goodness  of  fit  is  measured  by 

supJ/nQc)  ~f(x)\ 

SUP*  f(x) 

where  the  sup  is  over  IR^.  This  implies  that  the  mode-estimates  are  good  and  hence 
the  decision  rule  h(x)  is  close  to  the  Bayes  rule.  The  reason  for  this  measure  of  fit  is 
to  enable  a  comparison  of  performance,  i.e.  error  versus  sample  size,  across  different 
dimensions  N.  This  choice  will  ensure  that  the  0(e)  term  of  Perror  in  the  theorem 
is  independent  of  N  (it  may  depend  on  quantities  such  as  the  distance  between  the 
true  modes)  and  the  range  allowed  for  e  holds  for  all  TV;  we  still  say  for  small  e  >  0 
in  order  for  some  approximations  to  hold,  but  the  choice  will  be  independent  of  N, 
and  in  particular,  does  not  decrease  with  N. 

We  define  a  function  K„>x  €  Ka  as  follows:  let 

K.M  .  A-  (^) 

where  y  6  IR^,  and  x  is  in  a  compact  set  D  in  1R.^  (to  be  specified  later),  and  K  is 
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a  real  valued  function  chosen  as 


K(y)  =  K1(y1)Kl(y2)---K1(yN) 

where  Ki(yi)  is  an  (r  —  l)th  degree  polynomial  which  is  orthogonal  to  ?/i,  yi2, . . . ,  y\~x 
and  such  that 

(The  subscript  in  Kx  indicates  that  it  is  a  function  on  IR1.)  We  later  describe  the 
reason  for  this  choice  and  its  construction  in  detail.  Define  the  estimate  /n(x)  as  the 
empirical  mean  of  the  function 

< t~ni<xA •) 

i.e., 


where 

K*M  =  I<  (^)  ,  y,x£]RN,  a  €  IR. 

(Note,  we  use  a  double  subscript  for  Ka,x  which  indicates  it  is  not  the  same  function 
as  the  one  dimensional  kernel  K\.)  We  treat  £  as  a  constant,  acting  as  the  index  of 
the  function  in  the  class  K.a ,  while  the  only  randomness  is  in  the  sample  Ci,  (2,  •  •  • ,  Cn- 
Clearly  /n(.r)  is  a  random  variable  with  expected  value  f(x)  =  E (cr~N  KX}(r).  The 
bias  of  the  estimate  is  then 

suP.r  1  f(x)  -  fjx)  1 

sup*  /(*) 

We  can  express  the  error  of  the  estimate  in  terms  of  the  bias,  i.e., 

sup*  |/n(x)  -  /(.t)|  <  SUPr  1  f(x)  -  fjx)  |  +  supg  |/„,(x)  -  f(x)  \ 
sup  sf(x)  ~  sup  X,f(x)  sup  Xf(x) 

In  the  current  context,  f(x)  is  Gaussian,  hence  sup Xf{x)  =  The  bias  is 

nonrandom  and,  as  we  later  show,  decreases  to  zero  as  a  — ►  0.  The  learner  aims 
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at  reducing  the  kernel-window  but  not  too  fast  (w.r.t  n)  because  the  probability  of 
the  second  error  component  decreases  at  a  rate  which  becomes  worse  (i.e.,  slower)  if 
<7  — >  0  too  fast  with  n.  The  second  component  is  random  and  is  the  deviation  of  the 
empirical  mean  from  the  true  mean  of  Kx< a  which  we  can  bound  (after  we  partition 
the  domain  of  the  indexing-parameter  into  D  and  Dc)  using  the  uniform  SLLN  over 
the  class  Ka  of  functions  KXt(r,  x  €  D.  We  start  with  the  bias  component. 


5.4.1  The  bias  component 

As  will  be  seen  shortly,  the  bias  term  can  be  made  smaller  by  constructing  a  ker¬ 
nel  Ki(xi)  taking  both  negative  and  positive  values,  and  which  is  orthogonal  to 
xi,  Xi2, . . . ,  Xir_1.  We  first  define  the  one-dimensional  kernel.  Let  r  be  an  even  inte¬ 
ger  and 

TS  (T  \  —  /  0  aix\  lxl|  ^  1) 

|  0  otherwise. 

The  a,  are  chosen  as  the  solution  of  the  r  equations, 


«o(x?,x?) 

+ 

ai(x?,xj) 

+ 

•  + 

ar_i(x?,xj  *) 

=  1 

a0(x\,x^) 

+ 

+  •• 

••  + 

ar-ilx^xj-1) 

=  0 

aoWVi) 

+ 

®i) 

+ 

••  + 

ar-^Xi-1,*!-1) 

=  0 

where 

(/,<?)  =  J  ^  f{xi)g(xi)  dx\. 

Remark  :  (x\,x{)  =  0  if  i  +  j  is  odd,  and  (x\,  x^)  =  (.+2+1)  if  i  +  j  is  even. 

Denote  the  solution  vector  as  «,  and  the  matrix  of  the  dot  products  as  A.  There 
always  exist  a  solution  vector  a  for  any  chosen  r  since  the  matrix  A  has  a  non  vanishing 
determinant.  That  is  because  the  quadratic  form 

(b,Ab)=  [  bix\bjx\  dxj  =  f  Cj2bix i)  >0 

*/— 1  i,i=0  0  7 
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whenever  b  ^  0.  Our  one-dimensional  kernel  function  Ki(xi)  satisfies  (A”x(xx),  x\)  =  0 
for  1  <  *  <  r  —  1 ,  and  (K-i(xi),  1)  =  1.  Before  we  show  the  effect  of  r  on  the  bias,  we 
prove  an  upper  bound  on  the  magnitude  of  this  kernel;  this  will  be  used  later  when 
we  require  uniform  SLLN  convergence  for  this  class  of  kernel  functions.  We  look  for 
an  upper  bound  on  |A'i(x1)|.  From  Szego  &  Polya  [25,  page  89],  for  any  arbitrary 
polynomial  P(x x)  of  rth  degree  with  real  coefficients  such  that 

jjp^fdx ,  =  1, 

we  have 

uniformly  for  all  — 1  <  xx  <  1.  For  the  proof  see  Lemma  5.2  in  Section  5.3.  In  our 
case  the  polynomial  Ax(xx)  is  of  degree  r  —  1  and  satisfies 

f(Ki(x1))2  dxx  =  (a,  Aa)  =  (a,[10  . .  .0]*)  =  a0. 


Hence 


so  that 


dx\  =  1 


|Ai(xx)|  <  r 


for  |xx|  <  1.  Now  we  calculate  a0.  Without  loss  of  generality,  take  r  to  be  even.  By 
Cramer’s  rule  we  have 


or  — 1 


«0  " 


0 


1 

r+1 
r+T  0 


0 

I 

3 


1 

3 

0 


-  A  o  ih 


0 

0 


1 

2r— 3 


2r-3 


0 


TTY  0 


F=T  0  7TT 

ih  0 


2r— 1 


0 

r  —  1 

1 

r+1 


0 


2r— 3 

0 


1 


2r— 1 


(5.6) 
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With  some  manipulations  (see  Lemma  5.3)  and  for  even  r,  we  obtain 


1  (  (r  —  l)(r  —  3)  •  ■  ■  (3)  V 
0  2  \(r  —  2)(r  —  4)  •  •  ■  (4)(2)/ 


whence  from  Stirling’s  formula  applied  to  the  central  term  of  the  binomial,  we  obtain 
a0  ~  ^  as  r  ->  oc  through  the  even  integers.  Consequently, 

|^i(*i)|  <r^Y  (r  — ►  oo). 

With  x  =  [xi,X2, . .  • ,  x;v]  €  IR^,  from  before  we  have  the  TV-dimensional  kernel 
as 

I<(x)  =  K1(x1)K,(xi)--K1(xn). 


We  now  show  the  effect  of  r  on  the  bias  by  expressing  the  bias  as  follows  (all  integrals 
are  over  IR  V  unless  explicitly  specified): 


f(x)-f(x)  =  J  a  N  I<  )  f{y)dy  -  f(x)  J  K(y)dy 


since 


J  K(y)dy  =  j  K  i{yi)dyi  Kl{y2)dy2  •  •  •  J  I<i{yN)dyN  = 

Changing  the  variable  of  integration  to  z  =  we  obtain 

/(*)  “  f(x)  =  J  K(z)(f(x  +  az)  -  f(x))dz. 


We  expand  f(x  +  az)  in  a  Taylor  series  around  a  =  0.  The  bias  becomes 

N 

E 

*1=1 


sup  |/(*)  -  /(*)|  (2tt)n/2  =  (2tt )n/2  sup  J  I((z)  (f(x)  +  a  J2  ZiJ^\x) 


~  *li*2=l 

N 


+  Zi  J2  zhzi2-'-zirfhL...,ir(x  +  cz)- f(x)]dz 


rl 


*i,*2i  ...,*V=1 
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(27 r)N/2  sup 

X 

v  E  j  K(z)*iidz 

u  =1 

(2tt)7V/2sup 

X 

t  £  /?,aLc*)  /  *(»)*.*>* 

+  ••• 

1 

(2k)NI2  sup 

X 

E  /  fhi  (x  +  CZ)K  (*K  Zi2  *  *  •  Zirdz 

where  0  <  c  <  a.  Now,  by  Lemma  5.4  in  Section  5.3 

uniformly  over  *i,  i?, . . .  ik  €  {1,2, . . .  N}  for  k  <  r  +  1  where 

Mk  =  C(2x)-N/2khk  (5.7) 

with  C  an  absolute  positive  constant.  Then  by  the  mean  value  theorem  we  have 

!/«(,,)  _/«(*)!  <Mr+i|s,-*| 

implying  uniform  continuity  of  hence 

V<5  >  0  3c(<5)  3  |  f(r)(x  +  c(6)z)  -  f{r\x)\  <  Sx 

where  we  view  f  ^  as  a  function  of  x ,  while  c(  )  2'  as  a  small  deviation.  It  suffices 
that  |zc(5i)|  =  Si/Mr+i.  Now  \z\  <  2 y/N  as  z  ranges  over  [-1,+1]N  only.  To  achieve 
6i  deviation  the  lowest  necessary  c  is  8i/2\/NMr+ 1.  With  this  choice  we  can  write 

f^r\x  +  CZ )  <  f^r\x)  +  2\/N Mt+i<J  <  Mr  + 

as  0  <  c  <  cr,  for  a  suitably  small  choice  of  a  <  81/2\/NMr+1  (from  (5.7)  we  have 
Mt+ 1  <  oo).  To  avoid  carrying  the  nuisance  factor  of  around,  increase  each  Mr 
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slightly  to  include  an  extra  Si.  The  bias  is  now  bounded  above  by 

sup  |/(x)  -  /(x)|  (2x)WJ  <  (2w)n'2Mic  jr  I  j  K(z)zi,dz 
X  *1=1 

+  (2x )ni2m£  jr  1/ KMW* 

Z  *1,*2  =  1  J 

r  N 

+  (27 T)N/2^Mr  £  \jK(z)zhzi2 


+  •••  + 


Zi«  ■  •  ■  zirdz 


+  •••  + 


Using  the  orthogonality  of  K\{z\)  to  the  first  r  —  1  powers  of  z1?  only  the  last  term 
survives  so  that 

sup|/(x)-/(x)|(2x)w/2 

X 

=  (2tt  )N/2MT^J  \f  K(z)zilZi2---zirdz 

=  (27rf/2Mr^  (\J  Ki{zi)z\dzx  |  +  ||  I<i{z2)zr2dz: 

<  (2; t)n/2Nc2Mt^ 

where,  for  any  1  <  i  <  N,  |/*j  AT(z;)z[  dz,|  =  c2  is  an  absolute  positive  constant  as 

J  ^  I<i (zt)zrt  dzi  <  J  ^  \I<i ( Zi)zl \  dzi 

<  J  ^  I(j{zi)  dzisj J  ^  zf  dzi 


j  K\(zN) 


zrNdzN 


-  v^V^TT-02 

from  Lemma  5.3  for  a0 .  Now,  Lemma  5.4  shows  that 

Mr  <  C(27r)-jV/2rfer. 

The  bias  is  hence  bounded  above  bv 


(5.8) 


c3r  2  e 


ri 
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for  some  positive  constant  c3.  Using  Stirling’s  formula  for  r!,  we  see  that  the  latter 
goes  to  0  as  r  — *•  oo  given  that  a  <  1.  (Note  that  the  condition  on  a  from  before 
translates  into  the  requirement  a  <  c^{2‘K)N^2N~l^2r~r/2e~T .) 


5.4.2  The  random  part  of  the  error 


We  now  treat  the  second  component  of  the  error,  i.e. 

Slip*  l/n(g)  -  /(*)! 

sup*  / 

We  partition  the  domain  of  .r,  the  index-parameter,  and  apply  the  uniform  SLLN  over 
the  compact  part  of  the  partition.  Let  D  be  a  compact  subset  of  JR77  to  be  specified. 
We  have 


( 


SUP*  |  fn{x)  ~  f(x) 


>  e 


sup*/ 

<  p  /sup^gp  \fn{x)  -  /(x)|  suprg0c \fn{x)  -  f(x)\ 
V  SUP*  /  SUP*  f 


>  t 


<  p  f  SUp*eD  \fnjx)  -  fix) I  >  ,2\  +  p  /  SUPjegel/n(x)-/(g)|  > 

V  sup */  '  /  \  sup*/  '  / 


(5.9) 


where  the  first  term  on  the  right  is  ready  for  application  of  the  uniform  SLLN.  We 
first  show  that  the  second  term  is  sub  dominant.  We  have 


'sup*ggc  \fn(x)  -  f{x)\ 

V  SUp*  / 


>  e/2  <  P 


sup*gpc  |/n(x)|  |  SUP,6Dc|/» 

sup*  /  sup*  / 


>  e/2 


__  p  / sup*g£c  ^  E"=i  ^rfw(Ci) 


V 


sup*  / 


+ 


sup*eDc 


/  jr  I<cAy)f{y)  dy 


sup*/ 


>  e/2 


(5.10) 


Now  suppress  the  cr  to  simplify  notation,  and  choose 

D  =  {.r  :  | a:  —  #oi|  <  -4,  or  —  $02!  <  A} 
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where  A  will  be  specified  shortly.  (We  denote  by  poly(x)  an  rth  degree  polynomial  in 
xi,  x2,  ■ .  ■ ,  xn.)  We  have 


sup  I  [  K<r,x{y)f{y)dy 

x£Dc 


sup 

|^-0oi  |>-<4n|^c — ^02 1  >^4 


< 


< 


sup 

\x—9q\ \>Ac\\x—Qq2\>A 

+  sup 

\x—0qi  |>^n}®— fl02 1>-4 


J poly(y  -  x)\\y-x\<\f{y)dy 

J  poly(y  -  ®)l|j,-x|<i/i(j/|^oi)  dy 

J poly(y  -  x)l\y-x\<My\0o2)dy 

sup  /  I poly(y  -  ®)|  l|j,_x|<i/i(«/!^oi)  dy 

-6oi\>aJ 

sup  /  \poly(y  -  x)|  l\y-x\<if2{y\Qo2)  dy 

—Bc\o  I  >  ^4 


1^—002 \>A 

sup  I,  a  ,  „  \p°ly(y  -  x)l  1iv-«i<i (oJ\N/2e~^ l3/~g°1'2  dy 

|x— 0oi  |>-A  J\y—8o\ |>'4— 1  \^) 


+ 


<  c4 


I  lpol'Av  ~ 1)1 

/ 

J\v- 


e-§b-0o2|2 


dy 


3N 
f  2 


(27 t)N/2  J\y-0O1\>A-1  (27r)Ar/2 

3  N 

r  2  j- 

+  °4  ^)N/2  J. 


r  i  t  1 


1 - p-%\y-9 oil2  dy 

C  2  | 00 2 12  (ly 


(27r)N/2  J\y-e02\>A-l  (27T )Ar/2 
From  page  66  these  two  integrals  are  bounded  above  by  c2Ne2/A 4  for  some  positive 


constant  c.  Therefore 


sup  x€Dc 


f$ri<eAy)f(y)dy  I  c5r3Av2iv 


supx  f  crN  A4 

Applying  Markov’s  inequality  to  (5.10)  we  now  have 


=  A. 


^SUPx€Dc  n  Ell  krr,x{(i) 


1/(2tt)^/2 


>i-A 


<  P 


( sup*€Dc  J  E?=i 

■prKtrACi) 

1>H 

i  e^e 

^  n  ^ 

r=i  supl€Dc 

■prKo,x((i)  | 

^  1/(2tt)^/2 

i  ^ 

(i  -  mw) 

j*  /|i/i»t-i  suPxg d*  | poiyjy  -  x)hv-x\<if{y)  dy\  <  A  _  _A 


(f  -  A)/(2^)^/2) 


f-A 


4  A 
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since  for  any  e,  a ,  and  N  we  can  choose  A  large  enough  to  make  A  suitably  small 
compared  to  e.  We  hence  obtain 


/ supieDc  fn(x)  -  /(j) 
\  sup*  / 


> 


<  C6  A 


c7Nr3N !2 
A4 


=  6/2 


where  A  is  chosen  accordingly.  Hence  the  second  term  on  the  right  of  (5.9)  is  <  6/2. 
Now  we  estimate  the  size  of  n  needed  to  make  the  first  term  of  (5.9)  less  than  6/2. 

The  stage  is  set  for  an  application  of  Theorem  3.10.  The  class  of  functions  is 
the  set  {Kff'XiX  £  D}.  These  functions  are  uniformly  bounded  by  choice  of  compact 
D  (note  that  a  is  permitted  to  decrease  with  n).  Lemma  5.6  shows  that  this  class  is 
finitely  coverable  which  is  a  condition  for  its  permissibility  (needed  by  the  theorem). 
We  begin  by  bounding  E K2X.  We  have 


E  Kla(y)  =  J  f(y)Kl^(y)dy  =  aN  J  f(x  +  az)K2{z)dz 


with  the  change  of  variable  z  =  (y  —  x)/a.  In  our  case  /  is  a  multi-multivariate 
Gaussian  mixture  and 

/  <  l/(27rf/2.  (5.11) 


The  bound  hence  becomes 


E  K 


2 

Xy(T 


<  (crN / (2x)n/2)  J  I<2{z)dz 

=  (<tn/(2k)n/2)  J1  K2(zi)dzi  J1  Kl(z2)dz2  ■  ■  ■  K*(zN)dzN 

<  (*» (j  (2 -m,»))2)w 

<  IQ*)*')  0" 


We  let  this  be  62  in  the  theorem.  In  order  to  satisfy  the  conditions  of  the  theorem, 
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we  need  to  select  a  as  functions  of  n  such  that 

log  n/n 


8  2 

un 


0  (n  — >  oo). 


This  is  satisfied  if 


n  >  l  —  ]  log2 


"  (yfi? \N 


err 


\  ar  ) 


We  will  see  that  this  condition  is  trivially  met  in  our  subsequent  choice  of  n.  Now  for 
M  in  the  theorem,  we  bound  the  functions  in  the  class  as 


by  again  using  Lemma  5.3.  We  thus  have 


P  I  sup 

\x<ZD 


1  /  Ci  x 


i= 1 

=  P  I  sup  ' 

\x£D 


-E  K 


'Cl-®' 


>  e— 
M 


n 


2  —  1 


>  e<rNr-N^ir- 


=  P  (  SU 

\K  <T,.T 


sup 

£fCa 


1  XX,(C0-  e/c,*(Ci) 


ni= 1 


<  24 


32e  ,  32 e 

- lv  loS  - 


V 


(vT*)  *  (\/l*) 


N 


.  £N-N/2-N 
>  ea  r  ' 7r 

-B«a(\/?^)  /8192 


(5.12) 


The  left  hand  side  is  equivalent  to 


P  I  sup 

\x€£ 


/n(®)  -  /(®)|  >  e®  N 


=  P 


( SUPx6D  1/n  ~  /I  >  er-^/2T-^(27r)iV/2  ] 

v  suPl  /  y 

/ SUP xeD  1/n  ~  f\  ^  ,  (1_\N/2 

V  SUP*  / 


— 

r7r/ 


Redefining  e  as  e  ^  and  calling  the  right  hand  side  of  (5.12)  8/ 2  we  obtain 


'suPsgD  l/»(®)  -  IM I 


V  supx/(*) 
and  hence  combining  with  the  previous  result  we  have 

„  ( sup* |/„(®)  -  f(x) | 


>e)<8/2 


sup*  /(*) 


>  e  <8, 
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when 


n  > 


csrN  (2/tt)W/2  l  ..  r3Nl2  .  .  1 


7*? - [d'°S 


ea 


N 


+  log  7  • 


We  can  use  this  bound  once  we  calculate  d  =  VC{K„)  which  is  obtained  (see 
Definition  3.5)  by  noticing  that  K,a  is  class  of  functions  that  are  linear  combinations 
of  a  finite  basis  of  functions  (as  in  Theorem  3.6).  It  will  transpire  that  we  can  easily 
calculate  the  VC-dimension  of  the  graphs  of  such  functions  (Definition  3.4)  which  by 
Definition  3.5  is  VC^/C^). 

Define  T  as  the  class  of  graphs  of  the  ^-dimensional  kernels  Kc<x  €  where 
from  before  we  have  KaiX(y)  =  K(}L^£).  Recall  that,  by  definition,  K  has  a  compact 
support  [—1,1]^.  Each  function  in  fCa,  and  hence  each  graph  in  has  the  same 
fixed  a  but  has  a  different  iV-dimensional  vector  x  which  indexes  it  in  the  class.  The 
graphs  are  sets  in  IR^  x  1R  since  a  function  KatX(y)  is  a  mapping  from  IR^to  1R.  By 
Definition  3.5,  VC^/C^)  =  VC(.F)  so  our  aim  is  to  find  VC(.F).  Replace  x  with  0  to 
indicate  the  parameter  indexing  a  function  and  let  x  now  denote  the  domain  of  the 
function,  i.e.,  K„fi(x).  The  one-dimensional  kernel  equals 

r— 1 

—  01)T|a,1_<?1|<i,  x1,el  €  IR. 

1=0 

This  is  an  (r  —  l)t/l-degree  polynomial  in  aq  whose  coefficients  are  comprised  of  the 
0\  and  a,k.  Hence  we  write  the  iV-dimensional  kernel  as  a  product 


K<r,e{x)  —  polyr  {x1)polyr  (x2)  •  •  •  pohf  (x;v)l|;i;a-01|<il|:r2-02|<i  •  •  • 

where  polyr~1(-)  denotes  a.  polynomial  of  degree  r  —  1  in  a  single  variable  Denote 
by 

p(x)  =  pohf~1(x1)polyT~1(x2)  ■  •  •  polyr~1(xN). 

Write 

B  = 
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for  the  bound  on  the  magnitude  of  the  one  dimensional  kernel.  Use  the  notation 
|x  —  <1  to  denote  the  set  based  on  the  vector  inequality 

the  function  l|*-0|<i  then  connotes  the  indicator  for  the  cube  (in  IR^)  of  side  2  at  0. 
Let  Q±  denote  the  graphs  of  the  functions  ±Bl\x-$\<i  respectively.  Also,  let 

=  {(*>!/)  :  o  <y<  p(®)} 

V-  =  {(x,y)  :p(x)  <  y  <  0}. 

The  graph  of  Kaf{x)  is  then  represented  simply  by  (V+  fl  Q+)  U  (V-  fl  Q-) .  Our  aim 
now  is  to  express  this  set  by  intersection/unions  of  sets  of  the  form 

(x,y) :  y)  >  0 

i 

where  the  sum  is  finite.  Then  we  can  directly  apply  Theorem  3.6  to  find  the  VC- 
dimension  of  K„t$(x).  We  first  construct  a  function 

hPAx>y)  =  p(x)  -  av 


where  a  is  a  real  scalar.  We  have 

{(*,y) :  KAx>v)  >  :  v  ^  °)  =  {(*,»/) :  o  <y  <  p(x)} 


and 


{(*,! /)  :  fc_Pf_i(x,y)  >  0}  f|{(®»y)  :  V  <  0}  =  {(x,y)  :  p(x)  <  y  <  0} 
We  can  index  any  function  of  the  form 

p(x)  =  polyr~1(xi)polyr~1(x2)  ■  ■  ■polyr~l(xN) 
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using  the  basis 


{{1>  *!>*?»■ 


r  —  1 


}  X  {l,x2,x 


r- 

?  ^2 


'*}  X  •••  X  {l,*^*^,...^1}}. 


This  basis  has  cardinality  rN .  Hence  the  function  hPia(x,y)  can  be  expressed  as  a 
linear  combination  of  r^  +  l  terms.  By  Theorem  3.6  it  follows  that  the  VC  dimension, 
d,  of  the  class,  H,  of  sets  {(x,y)  :  hp<a(x,xj)  >  0}  is  at  most  rN  +  1.  The  class,  Ti\ 
of  intersections  of  such  sets  with  {(x,?/)  :  y  >  0}  can  pick  out  at  most  the  same 
number  of  dichotomies  of  an  m-sample  as  7i  does.  To  see  this,  note  that  if  (x‘,j/1), 
1  <  i  <  m  is  any  m-sample  shattered  by  H'  then  necessarily  we  must  have  y*  >  0 
for  each  i.  Now,  take  any  dichotomy  of  this  sample,  say  the  one  achieved  by  a  set 
A!  =  A  fl  {(x,y)  :  y  >  0}  which  is  an  element  of  the  class  of  sets  7i'.  (Note,  the  set 
A  is  in  7i).  Clearly,  the  set  A  achieves  the  same  dichotomy  of  the  sample.  Hence 
must  shatter  this  m-sample.  Hence  VC(7i )  >  VC(W'). 

So  therefore  the  VC  dimension  of  the  class  of  sets 


{(*>y)  :  KAxaj)  >  0}n{(x,y)  -  y>0}  =  {(x,y)  :  0  <  y  <p(x)} 

is  at  most  rN  +  1  and  likewise  the  VC  dimension  of  the  class  of  sets 

{(z,2/) :  fc_p,0(x,y)  >  0}fj{(a;,j/)  :  y  <  0}  =  {(x,y)  :  p(x)  <  y  <  0} 
is  at  most  rN  +  1.  Continuing,  we  have  by  definition, 

0+  =  Pi  €  1R/V  x  U  :  0  <  y  <  B  •  l|^t-6>,|<i }  • 

i 

It  suffices  hence  to  estimate  the  VC  dimension  of  the  class  of  sets 


Define 


{(*,?/)  :  0  <  y  <  B  • 


a(y)  = 


1  if  0  <  y  <  B 
0  otherwise. 
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It  is  easy  to  see  that 

{(*,y)  :  0  <  y  <  B  •  =  {(s,y)  :  a(y)  -  (*i  -  #i)2  >  0}. 

The  function  a(y)— (xi~ #i)2  is  a  linear  combination  of  the  function  basis  { x J,  Xi,  l,a(y)}. 
Hence  by  Theorem  3.6  the  class  of  sets  {(re,  y)  :  a(y)  —  (rri  —  6\)2  >  0}  and  therefore 
the  class  of  sets  {(: x,y)  :  0  <  y  <  B  •  has  VC  dimension  <  4.  This  is  true 

for  every  one  of  the  N  classes,  {(re,  y)  :  0  <  y  <  B  •  1  <  *  <  N.  By  Theorem 

3.3,  the  number  of  dichotomies  of  an  m-sample  that  is  picked  out  by  any  such  class 
is  <  m4.  Hence  the  class  of  sets  G+  can  pick  out  at  most  m4N  dichotomies  of  any 
m-sample.  It  follows  that  the  family  of  graphs  V+  fl  G+  picks  out  at  most  mrN+4N+1 
dichotomies  of  any  m-sample.  An  analogous  treatment  shows  that  the  number  of  di¬ 
chotomies  of  an  m-sample  picked  out  by  the  class  of  sets  'P_D(7_  is  at  most  mrN+4N+1. 

It  follows  that  the  class  of  sets 


r=(v+r\g+)u(V-r\g_) 


achieves  no  more  than 


m2rN+8N+2 


subsets  out  of  any  collection  of  m  points.  Denote  the  exponent  by  c  =  2 rN  +  8N  +  2. 
By  definition,  the  VC-dimension  of  T  is  bounded  above  by  the  largest  value  of  m  for 
which 

mc  >  2m 


whence  direct  computation  shows 

VC{T)  <  1.37c  log2  c  <  2c  log  c  =  (4rK  +  16V  +  4)  log(2rw  +  8N  +  2). 

Note  that  rN  >  4 N  +  1  when  r  >  5  +  log  N  for  every  N  >  1.  For  this  range  of  r  then, 

VC{F)  <  8rN  log  (4rN)  <  32 rm , 
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as  log  a;  <  x  for  all  x.  We  complete  the  bound  on  n  by  using  the  following  bounds  on 
d  and  M, 

d  <  32 r2N, 


M  < 


r3AT/2 

(27r)^2’ 


and  obtain  that 


og^+-^rlogi 


aNe2 


ea‘ 


(5.13) 


is  sufficient  in  order  to  have 


p  f«v.  |/„(*)  -  /Ml  >  \  <  6/4 

\  supx  f(x )  ) 

for  6,  c  arbitrary  positive,  and  a  sufficient^  small  as  before. 

Together  with  the  bias  we  have  that  with  the  same  n, 

S»'P,  I.M*)  -  /Ml  >  t  +  C3rf  er^jy 
sup  Xf{x)  r! 

with  probability  <  8/ 4.  To  simplify  this  we  replace  e  with  y/e/2  and  replace 

c3r*eT2jjN  by  \/e/2  (to  find  a)  yielding  total  error  of  -y/e.  First  look  at  the  first 
term  (ignoring  the  less  significant  log()  part)  in  the  bound  for  n.  We  have 

1  N!r  r.r.  ,2 N 


,,N 


(2/7r)iV/22 N/rer*rWWr)  (g)  /r  32 r 

(l+N/2r 


(5.14) 


Now  we  are  free  to  choose  r,  so  let  r  =  5  +  log  N .  We  have 


N 


JV/(5+log  N) 


\{5  + log  N)\J 
So  (5.14)  is  bounded  by 


N 


N/(5+logN) 


-  \  95+log7V 


<  jA7(5+log/V)  _  j 


Cl  1  ( 2 2  ) Ar lc*s( 5+los AT)  2^/(5+l°g^) 

^l-fiV/2  log  A" 


(5.15) 


141 


Now  22  <  12,  and  there  exists  a  cx 2  such  that  for  all  N  >  1, 

N/2 


'2JV 
,  *  ) 


'13  \  f  Ntog(5+logN) 
0 


/13N 
~  Cl2  V12> 

Use  similar  arguments  for  the  2^5+log^  to  get  that  (5.14)  is  bounded  by 


Ci2(12)^log(5+logA0  fWiog(5+logV) 

eN/2  log  N  ' 


Now  the  less  significant  log()  part  in  the  first  term  of  (5.13)  is  bounded  by 


^  3  Ar  log  ( 5+log  N )  /3 


^1 3 


12 


It  follows  that  (5.13)  is  bounded  by 

j  3.V  log  (5+log  N)  ^ 

°U  ^ A'/ 2  log  N  1<5S 


Hence 

suPx  l/n(x)  — /(x)l  ^  rz 
- 7T- i - >  V  € 

suPx  /(*) 


(5.16) 


when 

^  3  Af  log(5+log  AT)  j 

"£04  cN/2kfN  tog- 

with  probability  <  6/ 4.  We  now  need  to  see  how  this  translates  into  the  decision 
rule  error  by  showing  that  the  mode-estimates  will  also  be  close. 

Since  fn(x)  may  have  many  relative  maxima  we  need  to  choose  two  such  modes 
that  can  be  used  to  estimate  closely  the  modes  of  f(x).  This  is  achieved  by  the 
following  procedure  which  is  based  on  the  more  general  algorithm  K  (see  Section 
5.5).  The  requirements  that  are  necessary  for  this  procedure  is  that  0  <  e  <  6  where 
b  is  proportional  to  |#0i  —  #02 15  and  that  f(x)  must  have  two  modes  (not  just  one); 
for  this  reason  we  need  the  requirement  on  the  means,  0Oi  and  0{y2 .  of  f(x)  to  satisfy 
|6>oi  —  $02!  >  2  (see  Lemma  5.5). 
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Consider  first  the  case  of  x  €  1R.  Denote  the  true  modes  as  771  and  rj2  and 
without  loss  of  generality  let  r/2  >  Vi-  We  first  describe  how  the  mode  estimates  are 
calculated.  For  simplicity  we  proceed  with  the  assumption  that  supx  \  fn(x)  —  f(x)\  < 
e/2  and  later  replace  that  by  <  y/e  as  in  (5.16).  For  small  enough 

e,  the  learner  determines  the  maximum  of  fn(x)  to  be,  say,  closer  to  rji  which  puts 
it  under  the  first  hump  of  /(.r);  denote  it  by  fj  1  and  this  will  be  the  estimate  for 
1 j1.  Then  the  learner  determines  the  .r-coordinates,  x j ,,  and  .tu,  of  the  two  points 
closest  to  rji,  where  a  horizontal  line  through  the  point  (7)1,  fn(vi)  ~  8e)  cuts  fn(x)- 
Note,  because  I fn(x)  ~  /(^)|  <  t  it  follows  that  regardless  of  where  171  is,  we  have 
Vi  £  (xia,  xib)',  in  fact  that  is  the  case  even  if  the  line  is  defined  with  3e  instead  of  8e. 
This  guarantees  that  f(x)  is  decreasing,  when  moving  away  from  xla  or  from  xu  by 
a  small  amount  (again  under  the  main  assumption  of  small  enough  c).  Using  this, 
together  with  the  fact  that  |  f„(x)  -  /(.t)|  <  ewe  have  that  for  all  x  s.t.  x  <  xia 
or  x  >  xn  and  such  that  x  is  closer  to  771  than  t/2,  then  /n(x)  <  /„(»)  1)  —  6e.  We 
claim  that  there  exists  a  point  whose  ^-coordinate,  x2,  is  closer  to  ?;2  than  t]i  and 
fn(x 2)  >  fn{v  1)— 6e.  This  can  be  seen  by  drawing  a  line  through  (7)1, /( 171)— 4e)  which 
intersects  f(x)  on  the  second  hump  at  a  point  (x2, /(.r2)).  The  reason  it  intersects  is 
because  |/„(?)i)  -  f(m)\  <  which  implies  /n(?)i)  -  4e  <  /(? h)  -  e  <  /(tjx)  =  f(r)2). 
The  former  follows  from  the  fact  that 

I fn(m)  ~  f(V i)l  <  e 

and 

|/(»}i)  -  /(*7i)|  <  2e- 

The  former  is  trivial.  The  latter  is  true  since  assuming  the  contrary  implies  that 
fn(iji)  <  /n(?/i )  and  this  is  clearly  false  since  it  contradicts  the  definition  of  fji,  as 
being  the  supx/n(x).  Moreover,  /„(.r2)  >  f{x2)  -  e  =  fn{v  1)  -  5e  >  fn(fj i)  -  6e, 
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which  proves  the  above  claim. 

So  the  learner  needs  only  to  use  the  horizontal  line  through  (f/i,  fn(fji)  ~  8c), 
determine  look  for  the  maximum  of  fn(x)  over  x  B  x  <  xia  or  x  >  x^  and 

let  it  be  7)2-  This  guarantees  7)2  is  closer  to  772  than  771,  and  hence  the  consistency  of 
f/i  to  77 ,  as  e  — *  0.  For  f(x)  with  x  €  IR^  a  similar  procedure  can  be  applied  to  find 
the  mode  estimates. 

Now  we  find  the  possible  deviation  for  77;  from  77,-  where  i  =  1,2.  Without  loss  of 
generality  let  the  means  be  //x  =  0  and  //.2  =  //.  The  modes  are  77!  and  rj2.  We  look 
for  the  largest  deviation,  y,  from  iji  such  that  it  is  possible  for  fn{rji  +  y)  >  fn(vi)- 
This  is  the  largest  deviation  for  fji  from  771  and  by  symmetry  also  for  t)2  from  772  . 
When  <  e/2  then  at  the  point,  y  +  r/i,  we  have  /(t7X )-/(t/  +  77^  =  e/(2ir)N /2. 

Therefore  we  look  for  the  y  such  that  f(ih)  —  f{y+rii)  >  e/(27 x)NI2  which  implies  that 
>  e/2.  After  some  algebra  we  determine  y  =  for  some  positive  constant 
c15.  Then,  adhering  to  the  statement  of  (5.16)  we  replace  e/2  by  y/e  for  the  deviation, 
and  get  that  |/n(-)  —  /(•)!  ^  cie^/c  if  the  maximum  deviation  from  fji  to  77,-,  i  =  1,2,  is 
c17A/e.  So  from  last  paragraph,  it  follows  that  { |i7i  — 77i |  >  c17V/e)  or  1 772— 772 1  >  cUy/e} 
has  probability  at  most  8/2. 

Lastly,  the  learner  outputs  the  hyperplane  orthogonal  to  the  line  between  fji  and 
7)2  as  a  decision  rule.  Since  the  Bayes  border  is  between  771  and  772  then  from  Section 
4.1,  it  follows  that  the  above  deviation  yields  a  Perror  <  PBayes  +Ci8\/e2  =  Psaye®  + 
ci8e  if  the  regions  across  the  hyperplane  are  labeled  correctly.  Lastly,  using  the  same 
analysis  for  the  labeled  sample  complexity  as  in  Section  4.3  but  using  yfi  there  instead 
of  e  for  the  deviation  of  the  mode-estimates  we  get  that  with  m  =  c19  log  |  labeled 
examples  and  given  that  both  mode-estimates  are  Ci7v/e-close  to  the  true  modes,  then 
the  probability  of  choosing  a  labeling  (by  the  majority  rule)  with  Perror  >  Psayes  +  c 20C 
is  at  most  8/2.  Hence  the  probability  is  at  most  8  that  the  learner  outputs  a  decision 
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rule  whose  Perr0T  >  P Bayes  +  C20e  =  PBayes(  1  +  c21e).  This  completes  the  proof  of  the 
theorem.  I 

5.5  Mixture  of  a  General  Form 

In  the  previous  section  we  established  the  complexity  of  learning  a  decision  rule  for  a 
problem  based  on  a  Gaussian  mixture  /  using  a  procedure  that  estimates  the  modes 
of  /.  The  requirements  of  this  procedure,  together  with  the  resulting  exponentially 
large  n  stated  in  Theorem  5.1  suggest  that  the  procedure  is  powerful  to  handle  a  richer 
variety  of  mixtures.  That  is,  we  already  saw  in  Chapter  4  that  the  Gaussian  mixture 
can  be  learned  with  polynomial  sized  samples  given  that  parametric  side  information 
is  available.  So  having  an  exponential  sized  sample  suggests  that  the  technique  may 
be  powerful  enough  to  learn  problems  based  on  a  richer  variety  of  mixtures. 

In  this  section,  we  extend  the  intuition  that  modes  can  determine  the  Bayes  bor¬ 
der  for  a  large  nonparametric  class  (containing  Gaussian  mixtures)  where  the  mixtures 
are  not  necessary  identifiable.  We  define  a  family,  V ,  of  classification  problems,  each 
specified  by  a  pair  of  class  conditional  densities  and  having  a  Bayes  border  that  is 
identified  by  the  modes  of  the  mixture  corresponding  to  the  specific  problem.  We 
denote  by  T  the  class  of  mixtures  which  is  induced  by  V . 

First  we  will  prescribe  the  general  form  for  the  pair  of  densities  that  a  problem  in 
V  may  have,  through  several  conditions.  One  of  the  consequence  of  these  conditions  is 
that  a  sufficient  mixed  sample  complexity  for  learning  a  decision  rule  for  any  problem 
in  V  is  the  same  as  that  of  Theorem  5.1.  It  is  of  no  concern  whether  there  are  several 
different,  problems  in  V  that  are  associated  with  the  same  mixture  (in  which  case  the 
mixture  is  not  identifiable)  because  the  hyperplane  identified  by  this  mixture  is  the 
Bayes  border  for  all  these  problems.  We  will  impose  the  conditions  on  the  class  V  in 
the  course  of  the  discussion,  rather  than  all  at  once,  for  better  comprehension. 
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Secondly,  we  will  describe  the  algorithm  K  and  prove  its  consistent  estimation 
of  the  modes  of  an  /  G  T . 

We  now  proceed  with  the  description  of  the  problem  class  "P.  Let  a  classification 
problem  in  V  be  defined  as  having  the  following  class  conditional  densities 

/»,«,(*)  =  I* -Oil’). 

/>*(*)  =  -  ii\‘u 

for  Q{  G  IRa  (i  =  1,2)  and  where  g  is  smooth,  decreasing  on  [0,oo),  bounded  above 
by  1,  satisfying 

K„,2,...,Ilr(k|2)|  <  ci cr2r$ 

where  . . .  ,ir  G  {1, 2, . . . ,  iV),  ci,  c2  are  positive  constants,  and  that  the  absolute 
value  of  the  lower  partial  derivatives  of  order  less  than  r  is  bounded  by  some  positive 
constants  uniformly  over  x.  (If  we  use  a  bound  of  the  form  c3rr  then  the  unlabeled 
sample  complexity  will  differ  from  the  one  in  Theorem  5.1  only  in  the  constant  raised 
to  iVToglogiV).  An  additional  condition  on  g  is  that  its  induced  mixture  fg G  J7, 
where  6  =  [0i,02]>  must  have  at  least  two  modes.  (In  our  discussion  below  we  show 
that  the  last  condition  is  easily  satisfied  by  many  types  of  g.  )  For  brevity,  when 
there  is  no  danger  of  misinterpretation,  we  will  drop  the  subscript  g,6  and  refer  to  a 
mixture  just  as  /. 

The  pattern  classes  in  V  have  a  mixture  of  the  following  form 

fuAx)  =  ^7  “  0i|2)  +  ^(1*  “  02|2)) 

which  may  have  multiple  modes  (a  mode  is  a  local  maximum)  although  g(\x  —  6 12) 
has  a  single  mode  at  x  =  9.  Now  consider  the  region 
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which  is  the  Bayes  decision  border  of  the  problem.  This  region  is  equivalent  to 
{X  ;  \x  —  Q\  \  =  \x  —  02\),  ( g{ll )  has  an  inverse  as  it  is  decreasing),  which  is  a  hyperplane 
passing  through  the  midpoint  and  perpendicular  to  the  line  through  0\  and  02. 
(As  a  consequence  of  the  succeeding  discussion  it  follows  that  the  points  0X  and  02 
are  not  modes  of  the  mixture  /.)  We  now  prove  that  the  global  modes  of  f(x)  are 
on  a  line  through  0 1  and  02,  and  that  they  identify  this  hyperplane  and  therefore  the 
Bayes  border. 

We  will  show  that  the  set  of  modes  on  the  line  through  0i  and  02  (which  includes 
the  global  modes)  has  an  average  which  is  exactly  the  point  Moreover,  if  we 

take  only  the  global  modes  (assuming  that  there  are  moi'e  than  one)  their  average 
is  also  this  quantity  because  all  the  modes  on  the  line  appear  in  symmetric  pairs. 
Hence  the  global  modes  of  f(x)  identify  a  point  (i.e.,  their  average)  and  a  line  (going 
through  them).  The  hyperplane  which  goes  through  this  point  and  perpendicular  to 
this  line  yields  the  Bayes  decision  border. 

In  the  following,  we  ignore  the  normalizing  constant  and  the  a  priori  probabilities 
which  are  ~ .  Translate  the  coordinate  frame  so  that  the  origin  is  at  the  point  0\  ■  Then 
transform  to  a  new  primed- coordinate  system,  x'  =  Qx  s.t.  the  coordinates  of  0\  and 
0'2  are  on  the  .r'raxis  (the  first  point  is  the  origin  however  we  will  refer  to  it  by  the 
name  0[).  This  is  simply  a  rotation  hence  Q  is  unitary  and  the  Jacobian  equals  1 
yielding 


f(x')  =  g(\QTx'-QT0[\2)  +  g(\QTx'-QT0'2\2) 

—  y{\x' —  0[\2)  +  g{\x' —  02\2) 

=  g((x'i  ~  ^n)2  +  x  2  +  •  •  ■  x'n )  +  g(ix i  —  ^2i )  T  x  2  T  • 


•  •  x'n) 


Note  that  the  global  modes  of  f{x)  are  on  the  x[  axis  since  the  x'  which  maximizes 
f(x')  has  x'l  =  . . .  =  x'%  =  0  because  these  are  all  non  negative  quantities  while  g(y) 
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decreases  as  y  increases. 

Qt  tQt 

We  now  show  that  the  midpoint  between  the  modes  of  f(x')  equals  1 2  2  ■  Since 
the  global  modes  are  on  the  a/j-axis,  and  since  at  these  points,  all  the  partial  deriva¬ 
tives  are  0,  in  particular  the  partial  w.r.t.  .rj,  then  the  solution  set  of 

-£r  («((*;  -  «n)2)  +  «((*'.  -  ^.)2))  =  0 

must  contain  the  first  elements  of  the  global  mode  vectors  (which  is  the  only  nonzero 
elements,  w.r.t.  the  primed  frame).  We  get 

'j'U:c’,  -  O',,)2)  ,t'i  -  0',, 

Qt  tQt 

where  g'(-)  denotes  the  derivative  of  </(•).  For  convenience,  let  y  =  x\ - 11  2  21  and 

a  =  '  The  above  equation  becomes 

</((?/  +  a)2)  a-y  /-  17n 

9'((y-a )2)  y  +  a  K‘  > 

Now  since  g  is  decreasing  then  gf  is  negative  hence  the  left  side  is  positive  which 

implies  that  the  solution,  ?/,  to  the  equation  satisfies  —a  <  y  <  a.  Clearly  y  =  0  is 

a  solution.  Also,  suppose  yo  is  a  solution,  then  it  follows  that  — j/o  is  also  a  solution 

since 

</((?/  +  ft)2)  _  g  -  y 
</'((?/  -  «)2)  V  +  a 

implies  that 

<?'((<*  ~  ?/o)2)  _  a  +  ?/o 

</'((«  +  ?/o)2)  «  -  Vo 

and  hence 

<y'((~?/o  +  «)2)  _  a  ~  (~y0) 

<j'((-yo  ~  a)2)  -?/o  +  «  ' 

So  the  solutions  that  differ  from  0  appear  in  symmetric  pairs.  Regardless  of  the 
number  of  solutions,  clearly  their  average  must  be  y  =  0.  Moreover,  considering  only 
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the  global  modes  (which  must  be  on  this  line),  then  their  average  is  also  at  y  =  0 
again  due  to  the  symmetry.  That  is,  given  that  z  is  a  global  mode  then  —2  is  also  a 
global  mode  since  the  function  /  achieves  the  same  value  at  both  of  these  points,  i.e., 

f([zO  ...  0])  =  g  ((z  +  a)2)  +  g  ((z  -  a)2) 

=  <7((-*  +  «)2)  +tf((-2-a)2) 

=  /([-*0...0]). 

0f  4-0* 

So  taking  the  average  of  the  global  modes  yields  the  point  x'  —  11 2  21  which  is 

precisely  the  point  we  needed  to  show. 

As  was  shown  above,  the  line  through  the  modes  is  the  line  through  9\  and  02. 
Hence  given  that  there  are  at  least  one  solution  pair  (i.e.,  at  least  two  modes),  we  can 
identify  this  line  and  choose  the  hyperplane  perpendicular  to  it  that  goes  through  the 
point  which  is  the  average  of  the  modes;  this  yields  the  Bayes  border. 

Suppose  there  exist  other  two-pattern  classes  in  the  same  family  of  problems  with 

A  A  A 

class  conditional  distributions  say  g(\x  —  9\\ 2)  and  g(\x  —  92\2)  with  g  /  g,  and  9  ^ 
0,  0  ^  [02,  9\}T  (the  last  condition  ensuring  we  are  not  considering  the  simple  permu¬ 
tation  which  trivially  would  yield  the  same  decision  regions)  such  that  the  mixture  is 
the  same,  i.e., 

0(1*  ~  ^ll2)  +  0(1*  -  02|2)  =  0(|*  -  0i|2)  +  0(1*  “  02|2)  =  /• 


Then,  arguing  as  above,  we  get  that  the  average  of  the  global  modes  of  /  are  located 
on  the  line  through  #i,$2  and  their  average  is  ^±£2.  anci  similarly  that  the  global 
modes  are  also  on  the  line  through  9\,02  and  their  average  is  _  Clearly,  because 
the  modes  of  /  are  fixed,  the  two  averages  must  be  the  same  point.  Now,  if  we  assume 
that  /  has  two  global  modes,  then  the  above  two  lines  must  coincide.  So  although  the 
mixture  /  does  not  identify  a  unique  class  conditional  pair  it  still  identifies  a  unique 
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border  which  is  the  Bayes  border  of  both  different  problems,  i.e., 


{x  :  g(\x  —  $i|2)  =  g(\x  —  ^2|2)}  =  {x  :  \x  -  0X\2  =  \x  -  62\2} 

=  {x  :\x  -  6^\2  =  \x  -  62\2} 

=  {x  :  g(\x  -  <?i|2)  =  g(\x  -  02\2)}. 


Hence  this  algorithm  will  yield  the  Bayes  border  just  from  knowledge  of  the  /  although 
it  is  not  possible  to  uniquely  identify  the  class  conditional  densities  from  /.  As  seen 
above,  the  two  global  mode  requirement  is  also  necessary  for  determining  a  line  to 
which  the  optimal  hyperplane  should  be  perpendicular.  So  in  order  for  the  algorithm 
to  work  properly  the  mixture  needs  to  have  at  least  two  global  modes. 

We  now  show  that  there  exist  functions  g  which  give  rise  to  mixtures  /  that  have 
at  least  two  modes.  We  specify  g'  by  construction,  constraining  the  graph  of  g’(y)  to 
go  through  two  points  (i/,-,  A)  and  (?/,-,  B)  satisfying 


a  —  Vi  .  a  —  rjj  n 

0  <  m  <  ijj  <  a ,  - 7—  <  A,  - -  >  B 


(5.18) 


Vi  A  a  '  yj  +  a 

where  ?/,,  ?/,,  a,  are  any  fixed  scalars  and  A,  B  >  0. 

This  guarantees  that  the  first  point  lies  above  the  curve  and  the  second  point 
below  it.  Now  we  choose  any  continuous,  negative  function  g'  such  that 


<?'((?/»  + a)2)  _  ^  n(j  £  ((yj  a) )  _  q 
g'{{yi  -  «)2)  "  </'((!/;  -  a)2) 


(5.19) 


This  guarantees  that  the  curve  of  intersects  the  curve  of  at  least  at  one 

point  (yk,C)  where  j/;  <  ?/*  <  i.e.,  yk  ^  0  is  a  solution  of  (5.17).  By  the  above,  it 

follows  that  -j/fc  is  also  a  solution.  Let  us  just  show  that  there  exist  functions  g'  that 
satisfies  (5.19).  Fix  some  values  for  A,  B  that  satisfies  (5.18).  Arbitrarily  pick 

negative  values  for  </'((?/i  +  a)2)  and  </((?/,  +  a)2)  (negative  since  g  is  required  to  be 
decreasing).  Then  determine  the  necessary  values  of  </'((?/;  -  a)2)  and  g'((yj  -  a)2). 
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This  completes  the  specification  of  g',  i.e.,  specifying  the  value  that  g'  takes  at  the 
points  iji  -  a,  yi  +  -  a,i/?-  +  a.  Clearly  there  are  infinitely  many  such  g'  hence 

functions  g  that  satisfy  this,  and  they  need  not  be  indexed  by  a  finite  parameter 
vector.  Similarly  we  can  show  that  there  are  functions  g  with  more  than  two  modes. 

So  there  are  infinitely  many  functions,  g ,  which  are  decreasing  on  [0,  oo)  and 
have  continuous  derivatives,  g ',  such  that  the  corresponding  mixtures, 

/*,*(*)  =  Jv  (\A\X  ~  0i D  +  \di\x  ~  02|2)) 

have  at  least  two  modes.  Taking  the  hyperplane  perpendicular  to  the  line  through 
these  modes  and  which  passes  through  the  point  of  their  average,  yields  the  Bayes 
optimal  decision  border. 

We  now  explain  the  conditions  on  /  (on  page  146)  which  result  in  the  same 
sample  complexity  as  for  learning  the  Gaussian  mixture  of  Section  5.4.  The  only 
terms  depending  on  f(x )  (as  opposed  to  the  kernel  function  /t'(x))  which  influence 
the  sample  complexity  are  the  bound  on  f(x)  (used  in  (5.11))  and  on  its  rth  derivative 
(5.8).  We  demand  that  the  first  r  —  1  derivatives  be  bounded  by  some  finite  constants 
uniformly  over  x.  The  bound  need  only  be  specified  for  the  rth  derivative  since  by 
the  orthogonality  of  I<(x)  (see  page  109),  only  the  rth  term  survives. 

In  the  Gaussian  case,  Mr  =  c1(27r)-/v/2r§er  for  some  positive  constant  cx  which 
results  in  the  term  c2eNrN /2  in  (5.14).  It  follows  that  if  g  satisfies 


sup 

X 


q(r) 

Vxn,x 


<  CiAcr2 


where  ii,i2, . .  ■  ,iT  G  {1,2, . . . ,  N},  ci,c2  are  positive  constants,  and  the  absolute 
values  of  the  lower  order  partial  derivatives  of  g  is  each  bounded  by  some  finite 
positive  constant,  then  the  sample  complexity  n  is  the  same  as  in  Theorem  5.1. 

The  other  part  where  /  has  any  bearing  is  in  the  bound  of  E K%x  in  (5.11).  We 
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just  use  cN  instead  of  (2tt)n/2.  This  does  not  change  the  sample  complexity  order  of 
growth  since  it  only  introduces  a  different  constant  in  (5.15). 

We  now  discuss  algorithm  K  (stated  on  page  105)  which  is  used  to  determine 
consistent  estimates  of  the  modes  of  /  £  J~  and  identify  a  decision  rule  that  has  a 
Perror  close  to  PBayes • 

After  this  subsection,  there  will  be  one  more  condition  enforced  on  the  class  V , 
more  precisely,  on  its  induced  mixture  class  T . 

5.5.1  Discussion  of  Algorithm  K 

Above,  we  described  the  type  of  mixtures  /  <E  T  for  which  algorithm  K  can  construct 
a  decision  rule.  The  intuition  of  the  algorithm  is  based  on  the  fact  that  for  sufficiently 
small  e  >  0,  knowing  fn{x)  allows  to  determine  regions  which  contain  the  global 
modes  r/,  of  /  and  the  mode  estimates  ?},.  These  regions  are  closely  related  to  the 
global  humps  of  the  true  mixture  /  (a  hump  is  a  region  where  one  global  mode  is  the 
only  extrema).  We  now  prove  the  consistency  of  this  algorithm. 

We  assume  that  /  is  of  the  form  stated  in  the  preceding  section.  In  general, 
although  g  is  decreasing,  /  may  have  multiple  global  maxima  and  relative  extrema, 
also  on  the  a^-axis.  Denote  M  =  supr  f(x).  We  assume  that  f  has  at  least  two  global 
modes  ?/i,  j/2,  such  that  /(?/,)  =  M.  (We  showed  before  that  there  exist  g  that  have 
at  least  two  modes;  using  the  same  argument  we  can  show  there  exist  g  that  have 
more  than  two  modes.)  Moreover  let  us  assume  that  there  are  a  finite  number  of  such 
global  modes,  j/;,  1  <  *  <  k,  at  which  f{rjt)  =  M.  (This  further  restricts  the  type  of 
g  that  may  be  used  above  but  it  is  easy  to  show  by  the  same  construction  that  there 
exist  such  functions  g.)  Let 

L  =  sup /(*),  D  =  {x  :  /'(a:; u„,,.r)  =  0,.r  ^  ft, /fo)  =  M, /(»/,)  =  M,  1  <  i,j  <  k} 

x€D 
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where  /'(x;  uWiX)  is  the  directional  derivative  of  /  at  x  in  the  direction  of  the  unit 
vector  uVitX  whose  direction  is  the  same  as  the  ray  starting  at  rn  going  through  x. 
We  now  show  additional  constraints  on  g  s.t.  M  -  L  >  0  and  s.t.  /  decreases 
monotonically  along  any  ray,  starting  from  a  mode  in  some  small  ball  around  each 
mode.  These  will  be  crucial  for  our  algorithm. 

We  show  that  there  exists  a  6  >  0  s.t.  /  is  decreasing  along  any  ray  in  the 
direction  of  u^<x  where  x  is  in  the  ball  B[g,6)  around  a  mode  rj  (we  do  not  use 
subscripts  for  indexing  one  of  k  modes  and  instead  reserve  the  subscripts  to  specify 
elements  of  a  vector,  except  for  01  and  02).  We  will  use  the  primed  frame  (page  147) 
which  has  the  points  corresponding  to  0X  and  02  on  the  x'raxis  however  here  we  drop 
the  prime  '  from  the  notations  of  all  vectors.  To  prove  this  it  suffices  to  show  that 
/'(x;  u)  <  0  for  any  x  €  B(g,  6)  s.t.  x  /  g  and  where  u  =  By  definition, 

.  _  8f{x)  xx  -  tji  d/(x)  x2  -  g 2  _  d/(x)xjy  -jjn 

J  {x,u)  —  0Xl  |.T  — r/|  dx2  \x-7]\  Oxn  \x  —  T)\ 

=  (g'(\x  -  ^i|2)2(x1  -  On)  +  g'( \x  -  02|2)2(x,  -  9nj) 

+  (y'{\x  -  ^il2)2x2  +  g'(\x  —  02|2)2x2)  — 

+  •••  +  {g'(\x  -  0i\2)2xn  +  g'{\x  -  02\2)2xN) 

where  we  used  the  fact  that  0U  02  and  g  are  on  the  xraxis.  We  inquire  whether  the 
above  is  <  0.  Denote  by  a  =  g'{ |x  -  0x\2)  and  b  =  g'{ |x  -  02\ 2).  Then  it  is  the  same 
as  asking  if 

(o(xi  —  0\\)  +  b(x i  —  02i))  (-^i  —  Vi)  +  (a  +  b)(xl  + - f  x^)  <  0. 

But  the  second  term  is  negative  because  g'  <  0  as  g  is  decreasing  on  [0,oo).  So  it 
suffices  to  check  for  the  value  ol  x\  for  which 

(«(xi  —  On)  +  6(xi  —  02\))  (a:i  —  ?/i)  <  0-  (5.20) 
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(We  still  have  |x  —  i]\  <  6,  i.e.,  |xj  —  iji\  <  S,  as  a  constraint  on  xj.)  Without  loss  of 
generality  take  rji  <  772,  so  we  have  On  <  771  <  $21*  There  are  two  cases: 

•  x\  <  Tj\ ,  in  which  case  we  need  \b(xi  —  02 i)l  >  |«(®i  —  0n|  for  (5.20)  to  hold. 

•  x-i  >  rfi,  where  we  need  \a(xi  —  #21)!  >  \Kxi  ~  ^nl  f°r  (5.20)  to  hold. 


It  suffices  to  consider  the  two  following  cases: 


X^  -0M 

$21  —271  ’ 


02i-xi  ' 


We  know  from  previous  analysis  that  the  roots  of  are  mo<^es-  But 

by  definition,  7/  =  [771O  •  •  •  0]  is  a  mode.  So  using  the  same  notation  as  in  Section  5.5, 
and  generalizing  for  all  the  k  global  modes,  the  above  two  requirements  are  satisfied 
if  for  1  <  i  <  k,  g  is  chosen  s.t.  for  7 n  —  8  <  y  <  77,  the  function  is  above 

the  graph  of  ^  and  for  i]t  <  y  <  r/t  +  8  it  is  below  this  graph.  We  use  the  same 
construction  of  g  as  in  (5.18)  in  order  to  satisfy  these  conditions  and  hence  there  are 
infinitely  many  such  functions.  The  existence  of  S  >  0  follows  from  having  a  finite 
number  of  modes.  So  we  showed  that  there  are  infinitely  many  functions  g  s.t.  /  is 
decreasing  along  any  ray  rVi<x  for  any  x  6  B{yt,  6),  1  <  i  <  k,  for  some  8  >  0.  We 
now  use  this  to  construct  an  algorithm  for  estimating  the  modes. 

We  estimate  these  modes  by  1 )t,  1  <i  <  k  by  using  fn{x),  the  estimate  of  /(x) 
where  supx  |/(x)  -  /„(x)|  <  e.  (Note,  we  do  not  need  /„(x)  to  be  continuous  in  this 
algorithm;  this  is  crucial  since  the  kernel  estimate  gives  a  discontinuous  /„(x)  as  the 
window  functions,  i.e.,  the  polynomials,  are  truncated  at  ±1.)  For  the  algorithm  to 
work  we  need  the  error  accuracy  of  the  kernel  estimate  e  <  M~L . 

First,  find  the  argsupa.6y/n(x)  =  7)x  and  suppose,  w.l.o.g.  (771  —  rji\  <  \gi  —  Vi l> 
i  1.  (If  there  is  more  than  one  such  point,  then  choose  any  one.  )  Define  Hi ,  to  be 
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the  region 


Hi  =  {re  :  \x—rjt\  <  \x—rjj\,j  /  i,f  decreases  on  line  lm<x  in  direction  of  x,x  G  IRAr}U{r/,} 

and  let  H  =  U  H2  •  •  •  U  Hk  where  Hi  H  Hj  —  0. 

We  have 

/(x)  >  L=>  x  e  H. 

To  see  this,  suppose  x  ^  H .  That  means  as  we  walk  along  a  ray  from  some  rji  towards 
x  we  encounter  a  point  z  at  which  /'(z;uWi2)  =  0.  Moreover,  there  exist  such  a  z 
satisfying  f(z)  >  f(x).  Now  z  G  D  (where  D  is  the  region  in  the  definition  of  L). 
Hence  /(z)  <  L.  And  therefore  f(x)  <  L  which  proves  it. 

Now,  we  have,  1%  G  Hx.  This  follows  since 

fnim)  >  fnim)  >  fill)  -  €  =  M  -  e 

where  the  first  inequality  is  because  rfi  is  the  argsup/n(-)  over  X  3  r/i.  The  second 
inequality  follows  from  the  fact  that  for  any  x,  | fn(x)  —  f(x)\  <  t.  So, 

m)>fn(Vi)-e>M-2e>L 

where  the  last  inequality  follows  from  the  restriction  on  e,  i.e.,  e  <  — ^ . 

Now  define 

Bc  =  {.-r  :  fn{x)  >  fn{m)  ~  4e}. 

(We  will  not  carry  the  subscript  e  for  brevity.  )  Clearly  x  G  B  =>  x  G  H  i.e., 

BCH 


since 


x  G  B  =>•  fn(x)  >  fn(i)i)  -  4e  >  M  -  t  -  4e  =  M  -  5e, 


,f{x)  >  fn{x )  —  e  >  M  -  6e  >  L, 
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and  from  above,  f(x)  >  L  =>  x  6  H.  We  will  concentrate  only  on  subsets  of  B  hence 
throught  the  discussion  below,  we  have  /  with  the  ray-decreasing  property  of  H. 

We  next  define  a  region  A,-  which  depends  on  the  estimate  fji.  Up  to  now  we  only 
defined  fjx,  so  at  this  stage  only  Ax  can  be  defined.  However,  later  when  we  define 
the  rest  of  the  estimates  fji,  i  =  2, . . . ,  k,  we  will  use  the  following  definition  for  A,-: 


where  is  a  ray  from  fji  going  through  x.  (We  omit  the  e  subscript  for  brevity.  ) 
So  A,  is  simply  a  ball  around  ?/,  with  the  above-specified  radius. 

We  claim  that  the  region 

(B  -  Ai)  n  Hi  =  0. 

We  proceed  by  showing  that  all  points  x  in  Hx  that  have  /„(x)  >  fn(Vi)  4e  must 
be  also  in  Ax.  First,  we  prove  that 


Vi  €  M- 

Suppose  the  contrary.  Then  there  exists  a  z  €  dAx  on  the  ray  rvi  and  by  definition 
of  Ax,  fn{z)  <  fn(v i)  —  be.  Also,  /  is  decreasing  between  ? jx  and  fjx,  (since  we  showed 
that  fjx  €  Hx,  and  by  definition  rjx  €  Hx)  hence 

/(>).)  >  /(*)  >  /(>},)  ^  fM  >  n*)  -  £  >  /W.)  -  £  >  /«(> ft)  -  2'- 

This  is  a.  contradiction.  Now  we  prove  the  claim  making  use  of  the  fact  that  ijx  €  Ai. 
Suppose  the  contrary.  Then  there  exists  an  x  satisfying  x  €  B,  x  £  Al5  and  x  €  Hx. 
This  implies  there  exist  some  s  €  dAx,  i.e.,  on  the  border  of  Ax  as  in  the  definition  of 
Ax,  such  that  z  lies  in  between.? jx  and  .r,  i.e.  on  the  ray  rm>x.  Now,  \z-ijx\  <  I??! -x\. 
Since  x,tjX  €  Hx,  then  f(z)  >  f(x).  Also,  since  2  €  dAx  then  fn(z)  <  fn{fji)  -  6e. 


156 


And  fn(x)  <  fn{z )  +  2e  because  at  any  point,  fn  can  jump  by  at  most  2e  from  its 
current  value  (due  to  !/„(•)  —  /(-)l  <  e)-  Hence  fn(x )  <  fn(r}1)-6e  +  2e  =  /n(^i)-4e. 
So  x  does  not  satisfy  fn(x)  >  fn(fj  1)  —  4e.  So  x  £  B.  This  is  a  contradiction.  Hence 
after  removing  from  B ,  we  are  left  with  no  points  from  Hi,  i.e.  (B  —  Ai)C\Hi  =  0. 

Now  we  show  that  Hi,  i  ^  1,  has  points  that  are  in  B,  i.e.  after  the  removal  of 
A\,  we  still  have 

{B  -  Ax)  n  Hi  ±  0. 

First  we  show  that 

Vi  Ai,  for  2  <  i  <  k. 

We  have  three  key  points,  fji,  7/j  and  ?/,.  Consider  the  line  lm,nr  Clearly  there  exist  a 
V  €  ,t),  s.t.  f(y)  <  L.  That  is  because,  in  general,  on  a  ray  through  any  two  modes, 

i]i,  Tjj  of  /,  there  must  be  a  point  y  at  which  the  directional  derivative  f'(y,  uni  ^ )  =  0, 
and  also  recall  the  definition  of  L.  (Note  that  y  ^  Vi  since  the  two  modes  differ,  i.e., 
Vi  /  Vi-)  So  therefore 

fn(y)  <  L  +  e  <  M  —  It  =  f(v i)  — 7e  <  fn(r] i)-6e  <  /„(?)i)-6e  =^>  fn(y)  <  fn{rj i)-6e 

where  the  first  inequality  from  the  left  follows  from  the  condition  on  e.  Hence  either 
y  £  Ai  or  y  €  dA\  Therefore  the  radius  of  Ai  is  <  1 1%  —  y \  by  definition  of  A\.  Also 
l»7i  ~*?il  <  7)1  —  i],  |  by  definition  of  ifa.  So  we  have  a  circle  centered  at  171  with  77, •  on 
the  circle  and  ?/ 1,  ?/,  both  inside  the  circle,  both  lying  on  a  line  through  7/,-.  It  is  now 
simple  to  see  that  |?}i  —  ?/|  <  |t)x  —  7/,|  by  taking  a  radius  of  size  | fji  —  ?/,|  and  rotating 
it  until  it  goes  through  the  point  y.  So  the  radius  of  A\  is  <  | fji  —  77, |.  Hence  77,-  0  A\. 

Now  we  prove  that  there  exist  at  least  one  point  in  ( B  —  A\ )  n  Hi,  namely  rji 
itself,  for  2  <  i  <  k.  It  suffices  to  check  if  fn(vi)  >  fn(Vi)  ~  4e,  i.e.  the  definition  of 
B,  since  we’ve  already  proved  that  77 ,■  ^  A\  and  by  definition  of  Ht,  But  this 
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follows  trivially  as 


Mm)  >  f(Vi)  ~  «  =  f(7l  1)  -  e  >  f(Vi)  ~£>  fn{Vi)  -  2c  >  fn(vi)  ~  4e. 

We  now  define  the  rest  of  the  estimates,  namely  772, ... ,  fjk-  Find  the  point  that 
equals  argsupa,€B_i4l/n(.'r).  This  must  yield  a  point  which  is  not  in  Hi  since  we’ve 
shown  above  that  (B  —  Ai)  0  Hi  =  0.  This  point  must  be  in  B  fl  Hi  for  some 
2  <  i  <  k  since  we  proved  that  there  exists  at  least  one  point,  namely  77; ,  in  B  fl  Hi 
for  2  <  i  <  k,  and  since  B  C  H.  W.l.o.g.  suppose  this  point  falls  closer  to  rj2  than 
to  any  other  77,-,  3  <  i  <  k.  We  define  this  point  as  i}2-  From  these  last  statements 
and  from  the  definition  of  H2,  we  have  772  €  H2.  We  can  then  define  A2  as  was  done 
above  for  the  general  At.  There  is  a  slight  asymmetry  in  the  way  we  defined  the 
estimates  since  we  used  fji  as  the  pilot,  in  the  definition  of  all  Ai,  1  <  i  <  k  and  for 
the  definition  of  the  region  B.  Hence  we  will  go  through  the  proofs  once  more,  to 
show  that  they  still  work  when  for  getting  fji,  2  <  i  <  k. 

As  was  the  case  for  Ai,  here  too  we  claim 

((B  -  Ai)  -  a2)  n  H2  =  0 

as  we  now  show.  First  we  prove  that 

7/2  €  A2. 

Suppose  the  contrary.  Then  there  exists  a  z  (E  dA2  on  the  ray  rr)2xf)2  and  by  definition 
of  A2,  fn{z)  <  fn(i]  1)  —  6e.  Also,  /  is  decreasing  between  ij2  and  7)2,  (since  we  showed 
that  7/2  G  H2,  and  by  definition  ?/2  €  H2)  hence 

f{v 2)  >  f(z)  >  f(v 2)  =►  fn{z)  >  f(z)-€  >  — C  >  /n(?/2)-2e  >  /n(l 7l)-4c  >  /n(^l)-6c 

This  is  a.  contradiction.  (Note,  the  inequality  before  last  follows  since 

l/n(»/i  -  /»(%))!  <  2c. 
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In  fact,  for  any  1  <  *  ^  j  <  k,  | /„(?),)  —  fn(Vj) \  <  2e  because  for  any  1  <  i  <  k, 

fn(fji)  >  fn(m)  >  f(vi )  —  e  =  M  -  e 

and  fn(fji)  <  M  +  e.)  Now  we  prove  the  claim  making  use  of  the  fact  that  tj2  €  A2. 
Suppose  the  contrary,  i.e.,  that  x  €  B ,  and  x  €  H2  but  x  ^  A2.  This  implies  there 
exist  some  z  €  dA2,  i.e.,  on  the  border  of  A2  as  in  the  definition  of  A2,  such  that 
x  lies  in  between  r/2  and  x,  i.e.,  on  the  ray  rm<x.  Now,  \z  —  772!  <  Vli  ~  XY  Also, 
since  z  €  dA2  then  fn(z )  <  /„(r/i)  -  6e.  Since  x,i]2  €  H2 ,  then  f(z)  >  f(x).  And 
fn(x)  <  ,fn{z)  +  2e  because  at  any  point,  fn  can  jump  by  at  most  2e  from  its  current 
value  (due  to  |/„(-)  -  /(-)|  <  e)-  Hence  /„(x)  <  /„(?}  1)  -  6e  +  2e  =  fn(fji)  -  4e.  So  x 
does  not  satisfy  /„(x)  >  fn(vi)  ~  4e.  So  x  $  B.  This  is  a  contradiction.  Hence  after 
removing  A2  from  B ,  we  are  left  with  no  points  from  H2,  i.e.  ((£?  —  Ax)  —  A2)nif2  =  0- 
As  before,  after  removal  of  A2  from  B  —  A\  we  still  have  points  in  Hi,  3  <  i  <  k, 
which  are  in  B,  i.e. 


{(B-A1)~  A2)nH{  ^0  for  3  <  i  <  k. 

First  we  show  that  ?/;  ^  A2,  for  3  <  i  <  k.  We  have  three  key  points,  V2  and  77,-. 
Consider  the  line  lm,m.  Clearly  there  exist  a  y  €  lm,v,  s.t.  f(y)  <  L.  That  is  because, 
in  general,  on  a  ray  through  any  two  modes,  77,-,  77,  of  /,  there  must  be  a  point  y  at 
which  the  directional  derivative  /'(?/;  uVi,V])  =  0,  and  also  recall  the  definition  of  L. 
(Note  that  y  /  77;  since  the  two  modes  differ,  i.e.,  y2  7^  )  So  therefore 

fn{y)  <  L  +  e  <  M  —  It  =  f(i]i)  —  7t  <  /„( 77i)-6e  <  /n(»7i)-6c  fn(y )  <  /n(i?i)-6e 

where  the  first  inequality  from  the  left  follows  from  the  condition  on  e.  Hence  either 
y  $  A2  or  y  €  0A2.  But  therefore  the  radius  of  A2  is  <  |?)2  —  77 1  by  definition  of  A2. 
Also  \i)2  —  i]2\  <  | ?)2  —  77,: |  by  definition  of  i)2.  So  we  have  a  circle  centered  at  i)2  with 
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7]i  on  the  circle  and  7/2,  y  inside  the  circle,  both  lying  on  a  line  through  r/,-.  It  is  now 
simple  to  see  that  \f)2  —  y  \  <  1 1)2  ~  *?«|  by  taking  a  radius,  of  size  |t?2  —  77, |,  and  rotating 
it  until  it  goes  thorough  the  point  y.  So  the  radius  of  A2  is  <  1 772  —fji |.  Hence  rji  £  A2. 

Now  we  prove  that  there  exist  at  least  one  point  in  (( B  —  Ax)  —  A2)C\Hi,  namely 
rji  itself,  for  3  <  i  <  k.  It  suffices  to  check  if  fn(Vi)  >  fn(fji )  —  4e,  i.e.  the  definition 
of  being  in  B ,  since  we  already  proved  that  77;  ^  A2  and  by  definition  of  Hi,  rji  €  H{. 
We  have 


fn(Vi)  >  f(Vi)  -  e  =  f(y  1)  -  £  >  f(v  1)  -  c  >  fn(fh)  -  2c  >  fn(yi)  ~  4e. 

Using  the  above  if  we  take  argsup3,eS_/ll_/l2/n(a;),  this  must  yield  a  point  which 
is  not  in  H\,  nor  in  This  point  must  lie  in  B  fl  Hi  for  some  3  <  i  <  k.  W.l.o.g. 
suppose  the  point  falls  closer  to  7/3  than  to  any  other  rji,  4  <  i  <  k.  We  define  this 
point  to  be  ?)3.  From  these  last  statements  and  from  the  definition  of  H3,  we  have 
fl 3  G  Hz- 

So  it  is  clear  that  our  procedure  for  finding  77,-,  1  <  i  <  k  continues  as  above  until 

all  rji  have  been  found,  the  last  one  being  77*.  =  argsup^^^ _ Ak_^  fn(x).  After  that 

stage,  we  have  removed  A ,,  1  <  i  <  k  from  B,  i.e.,  we  are  left  with  the  region 

B  n  a\  n  •  •  •  n  a% 

(where  intersection  by  the  complement  is  the  same  as  subtracting  a  region)  whose 
intersection  with  any  of  Hi,  1  <  i  <  k,  is  empty.  But  recall  that  B  C  H1UH2  . .  .U  Hk- 
This  means  the  region  that  the  learner  is  left  with  does  not  have  any  point  x  s.t. 
fn(x)  >  /n(?)  1)  — 4e  and  at  that  stage  he  stops  the  algorithm  since  no  point  is  returned 
for  the  argsup  of  /„,  i.e., 


argsup  BnA'n-nA%fn(x)  =  argsup  0/„(x)  =  0. 
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Note  that  the  learner  does  not  need  to  know  the  number  of  modes,  k ,  of  /  since  all 
he  needs  to  do  is  keep  finding  the  argsup  of  fn  over  a  region  which  is  totally  defined 
by  the  first  estimate,  fj i,  and  which  becomes  smaller  and  smaller  until  it  becomes 
empty  exactly  when  there  is  no  need  to  estimate  anymore  modes. 

Finally  we  show  that  the  above  estimates  are  consistent  as  e  — ►  0.  From  above, 
both  fji,  7}i  G  Considering  the  terms  inside  the  definition  of  At|C  we  have 

fji  — >  i]i  as  e  — ►  0 


since 

7)1  =  argsup xeXfn(x)  -*  argsupxeA,/(.-r)  =  t/j 
since  | fn{x)  —  f(x) |  — >  0  for  any  x  G  X.  Also, 

Mv i)  ->  f{m)  =  /(argsupx€X/n(*))  -»•  /(argsupxeA./(x))  =  f(i n) 

and  fn{z)  f(z)-  Hence 

=  {y  :  Vj  -  T)i\  <  inf  inf  \z  -  r/,|,  fn(z)  <  /„( 7/i)  -  6e}  U  {?/,•} 

{?/  :  I V  -  Vi\  <  inf  inf  \z  -  i]i\J{z)  <  f(Vl)}  U  {r/,} 

=  {y  ‘  \y  -  w\  <  0}  u  {vi}  =  t?,- 

As  both  iji  and  7/,-  are  in  A£it,  the  above  implies  that  7/,  — *■  r]t  as  e  —*  0,  for  all  1  <  *  <  k. 

5.5.2  The  resulting  Perror 

In  the  previous  subsection,  it  was  only  necessary  to  know  M  —  L  in  order  to  specify 
the  allowed  range  for  the  accuracy  parameter  e,  without  needing  to  know  the  number 
of  modes  of  /.  Thus  far  the  algorithm  yields  consistent  estimates  for  the  modes  of  /, 
where  /  G  T ■ 
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Using  the  modes  estimates  we  can  form  a  hyperplane  estimate  of  the  optimal 
Bayes  hyperplane  as  follows:  Given  the  k  mode  estimates  j),-,  1  <  i  <  k,  we  minimize 
the  function 

wi\  J 

w.r.t.  the  (TV  —  1)  unknown  (TV  +  l)-dimensional  unit  vectors  wx, . . .  under 

the  constraints 

( iDj ,  Wk)  =  0,  for  1  <  k  <  TV  —  1  and  k  7^  j. 

(Nonlinear  programming  is  one  approach  to  solve  this.)  This  will  find  a  line  l  in  IR^ 
which  is  least-square-close  to  the  k  mode  estimates.  We  used  here  the  fact  that  a  line 
in  IR^  can  be  represented  as  intersection  of  affined  hyperplanes  in  1R 'V ,  i.e.,  as  the 
set  of  all  points  that  are  orthogonal  to  a  specific  set  of  vectors  wj ,  1  <  j  <  TV  —  1. 
The  term  inside  the  double  summation  represents  the  distance  of  the  ith  point  to  the 
jth  hyperplane.  Thus  e  represents  the  total  distance  squared  of  the  k  points  from  the 
line  in  IR^,  and  minimizing  e  obtains  the  least  square  line  /. 

We  then  form  the  average 

1  k 
V  =  T 
K  i= 1 

and  define  the  hyperplane  estimator  to  be  the  unique  hyperplane  which  is  orthogonal 
to  the  line  l  and  which  goes  through  the  point  rj  (which  is  not  necessarily  on  the  line 

0- 

As  n  — >  00,  the  mode  estimates  converge  to  the  true  modes  and  the  hyperplane 
estimate  converges  to  the  optimal  Bayes  hyperplane.  So  there  exists  some  function 
/i(e)  such  that  the  classification  error  of  the  decision  rule  based  on  this  hyperplane  is 

Pi error  —  PBayea(  1  T  /l(c)) 

where  h(e)  — ►  0  as  e  — *  0  when  the  regions  are  labeled  optimally.  As  in  the  Gaussian 


1  =  1  7“  1 
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case,  we  can  use  the  majority  rule  with  c\  log  j  labeled  examples,  where  c\  >  0  is 
some  constant,  to  guarantee  with  confidence  >  1  —  8  that  this  is  true. 

Now,  the  function  h(e)  is  the  accuracy  parameter  of  the  probability  of  error 
of  the  classifier.  The  function  h  depends  on  the  type  of  g’’ s  that  are  permitted  in 
the  definition  of  the  problem  family  V  since  it  is  directly  related  to  the  amount  of 
deviation  possible  by  the  mode  estimates  fji  when  the  kernel  estimate  fn(x)  deviates 
by  no  more  than  e  from  f(x)  uniformly  over  x  £  IR .  The  flatter  the  main  humps  of 
/,  the  more  such  deviation  is  possible  and  the  more  PerTOr  can  deviate  from  Psayes • 
Therefore  in  order  to  be  able  to  claim  an  accuracy  h(e)  uniformly  for  all  problems 
in  V,  given  the  sample  complexities  of  Theorem  5.1  we  need  to  ensure  that  we  define 
V  with  dependence  on  h.  One  way  to  create  a  V,  is  to  consider  a  union  of  families  of 
classification  problems,  the  ith  family  Vt  being  composed  of  density  functions 

/»*(*) /„*(*)=« «!.«.€  rw 

and  such  that  for  the  class  Vt)  the  misclassification  error  accuracy  is  /i;(e).  (It  is  not 
difficult  to  approximate  /i,(e)  since  it  suffices  to  consider  one  type  of  function  <7,.) 
Then  define 

Vh  =  U \=lVi 

where  l  <  00,  and  h  is  an  envelope  function  for  all  the  h{,  i  =  1, . . . ,  /,  i.e. 

h(x)  =  sup  hi(x). 

X 

Then  any  classification  problem  in  the  family  Vh  can  be  learned  to  an  accuracy  h(e) 
using  algorithm  K  with  the  sample  complexities  of  Theorem  5.1. 

5.6  Neural  Network  Clustering 

Here  we  describe  simulation  results  of  a  neural  network  (Ivohonen  [24])  based  on 
the  Kohonen  self-organizing  maps,  which  can  learn  using  unlabeled  examples.  Our 
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results  are  qualitative,  giving  the  intuition  for  comparing  two  extreme  scenarios:  an 
only-labeled  sample  with  side  information  about  the  class  densities,  versus  a  mixed 
sample  (few  labeled  examples)  without  side  information;  the  latter  is  implemented 
via  a  neural  network. 

The  Kohonen  neural  network  is  a  popular  algorithm  that  has  found  numerous 
applications  among  a  wide  range  of  fields,  e.g.,  statistical  pattern  recognition,  robot 
control,  adaptive  communication  schemes,  and  speech  recognition.  It  is  biologically 
inspired  by  the  cortical  maps  in  the  brain  that  are  topologically  ordered  and  organized 
with  high  dependency  on  their  input  features.  Its  algorithm  is  very  similar  to  the 
fc-means  algorithm  which  is  an  a,d  hoc  procedure  used  to  partition  multivariate  data 
into  cells  that  resemble  the  clustering  of  the  underlying  distribution.  We  first  describe 
the  algorithm  and  then  show  how  it  can  be  used  for  learning  classification. 

The  neural  network  consists  of  k  neurons  with  real  weight  vectors  tu,-  G  X,  i  = 
1,. . . ,  k  where  X  is  the  space  over  which  examples  x  are  drawn  according  to  some 
distribution.  The  neurons  are  arranged  in  a  two-dimensional  array  which  defines 
their  spatial  neighboring.  It  is  then  possible  to  define  the  influence  of  a  neuron  on  the 
adaptation  of  other  neurons  in  its  vicinity.  The  notion  of  vicinity  is  not  in  X  space 
but  is  measured  by  the  array-index  according  to  which  the  neurons  are  ordered.  The 
weight  vectors  are  adapted  by  the  following  iterative  procedure 


(  Wi(t)  +  a(t)  (x(t.)  -  Wi(t))  if  i  G  Nc(t), 
\  u>i(t)  if  i  &  Nc(t). 


(5.21) 


where  t  represents  discrete  time,  a(t)  is  an  adaptation-gain,  and  Nc(t)  is  an  index  set 
of  the  neurons  around  the  winner  neuron  whose  index  is  c.  We  define  the  winner  as 
the  one  neuron  whose  weight  vector  wc  is  the  closest  to  x  w.r.t.  the  Euclidean  norm, 
i.e.,  |.t  —  10 c |  =  min !<;</,.  |.r  —  u>i\.  One  can  view  the  quantity  |x  —  ro,|  as  a  real- valued 
output  of  the  ith  neuron.  Hence  in  effect  this  algorithm  is  a  model  of  a  collection  of 


164 


neurons  all  seeing  the  same  input  vector  x  and  adapting  their  sensitivity  to  x  (i.e., 
the  weight  vectors)  according  to  both  the  input  x  and  the  activity  of  other  neurons. 
As  time  evolves,  the  activity  of  only  the  nearer  neighbors  influence  the  adaptation  of 
a  neuron’s  weight  vector.  The  parameters  <■>(/)  and  Nc(t )  start  at  some  initial  value 
and  decrease  at  the  rate  of  0{l/t).  This  choice  is  ad  hoc ,  however  with  it,  the  vectors 
w  get  ordered  in  a  way  which  resembles  the  natural  clustering  of  the  examples  which 
are  drawn  according  to  the  unknown  underlying  distribution.  This  is  the  fame  of  the 
Kohonen  self-organization  phenomenon;  it  is  based  on  the  intuition  that  the  density  of 
weight  vectors  ;c,  in  X  space  tends  to  imitate  the  probability  density  of  the  examples 
x.  In  this  regard  it  is  similar  to  some  non-parametric  density  estimation  techniques. 
It  is  possible  to  use  this  neural  network  for  learning  a  two-class  classification  problem 
as  we  shall  see  below. 

We  define  a  nearest  neighbor  partition  of  the  weight  vectors  w  with  the  ith  vector 
Wi  corresponding  uniquely  to  a  voronoi  cell 


Vi  =  {.t  :  \x  —  ictj  <  |x  —  Wj |) 


(5.22) 


for  j  7^  i.  Clearly,  if  this  partition  is  labeled,  i.e.,  each  cell  gets  a  label  1  or  2,  then 
we  have  a  decision  rule:  given  an  ,t,  classify  it  by  the  label  of  the  cell  in  which  it 
falls.  Therefore  the  following  learning  procedure  emerges:  pick  randomly  k  weight 
vectors  then  show  n  unlabeled  examples  x,  while  adapting  the  w  vectors  according 
to  the  Kohonen  rule.  Define  a  nearest  neighbor  partition  using  the  w  vectors,  then 
show  m  labeled  examples  and  use  the  majority  rule  per  cell,  to  label  each  cell,  and 
the  resulting  labeled  partition  is  the  classification  decision  rule.  This  is  the  basis  for 
our  neural  networks  learning  classification  experiments.  We  now  describe  our  results. 
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A  4  Neuron  Net  trained  on  a  2  dimensional  mixture  of  two  gaussians  with 
different  variances.  After  training  only  on  unlabeled  example,  the  neuron 
weight  vectors  define  a  voronoi  partition  that  is  then  labeled  using  labeled 
examples. 


Class  #1 
Class  #2 

x  Voronoi  Vectors 


Figure  5.3: 

5.6.1  The  value  of  a  labeled  example  versus  Perror 

We  investigated  the  mixed  sample  sizes,  n  and  m  with  dimensionality  N  =  2,  for 
achieving  a  specified  error;  as  in  previous  sections,  the  labeled  examples  are  used 
only  for  labeling  the  decision  regions.  We  simulated  a  4-neuron  network  with  weight 
vectors  in  IR2.  Unlabeled  examples  x  £  1R2  were  drawn  according  to  a  mixture  of 
two  Gaussians  each  with  a  different  covariance  matrix  (both  were  diagonal  matrices). 
Figure  5.3  shows  the  actual  data  drawn  from  this  mixture;  the  lines  represent  the 
voronoi  cell  borders  of  the  partition.  We  ran  six  experiments,  each  differing  in  the 
number  of  randomly  drawn  unlabeled  examples  ranging  from  n  =  20  up  to  n  = 
10,000.  In  each  experiment  we  then  showed  in  randomly  drawn  labeled  examples, 
where  m  ranged  from  the  minimum  number  necessary  in  order  not  to  leave  out  any 
cell  unlabeled,  up  to  m  =  100.  We  measured  the  classification  error  as  a  function  of 
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n  and  tyi ;  the  learning  curves  are  shown  below  in  Figure  5.4.  The  curves  of  the  neural 
network  corresponding  to  the  lower  unlabeled  sample  sizes  do  not  reach  the  low  error 
rate  as  m  increases  because  we  did  not  utilize  labeled  examples  to  further  adjust  the 
decision  border  as  in  the  different  variants  of  the  LVQ  algorithm  of  Kohonen  [24].  For 
each  experiment  we  averaged  the  learning  curves  of  50  different  neural  networks.  As 
a  reference  we  then  conducted  an  experiment,  using  the  knowledge  of  the  parametric 
form  of  the  class  densities  to  estimate  the  sufficient  statistics,  i.e.,  the  means  and 
covariances  for  each  of  the  Gaussians,  by  using  only  labeled  examples  with  which  the 
Bayes  optimal  decision  border  was  estimated.  With  respect  to  the  neural  net,  this 
experiment  is  different  since  significantly  more  information  (parametric/identifiability 
knowledge  of  the  class  conditional  distributions)  is  provided  to  the  algorithm. 

Examining  the  intersection  of  the  (dotted)  curve  of  the  purely  labeled  experiment 
with  the  (solid)  curves  of  the  neural  network,  gives  approximately  the  number  of 
unlabeled  examples  necessary  for  the  labeled  sample  size  of  the  neural  network  to 
differ  by  one  example  from  the  labeled  sample  size  of  the  parametric  algorithm.  This 
intuitively  represents  an  upper  bound  on  the  value  of  one  labeled  example  in  terms 
of  unlabeled  examples  because  it  says  that  for  a  fixed  error,  with  no  side  information 
and  with  minimal  usage  of  labeled  examples  we  need  this  many  unlabeled  examples 
and  one  fewer  labeled  examples  than  the  case  which  has  maximum  side  information 
and  uses  labeled  examples  efficiently.  The  points  where  the  parametric  algorithm 
curve  intersects  the  neural  net  curves  are  plotted  in  Figure  5.5.  There  we  see  that 
the  value  of  a  labeled  example  increases  sharply  as  the  objective  Terror  is  reduced. 

5.6.2  m  versus  the  dimensionality  N 

In  this  section  we  describe  the  effect  on  the  labeled  sample  size  m  when  increasing  the 
dimensionality  N .  A  partition  can  be  labeled  by  any  one  of  the  2  labelings,  where  k 
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The  set  of  solid  curves  corresponds  to  learning  with  both  unlabeled  and  labeled 
examples,  n  =  #  unlabeled  examples. 

The  dotted  curve  represents  learning  with  only  labeled  examples. 


_  n  _  2Q  #  Labeled  Examples 

—  n  =  50 
n  =  500 

—  n  =  1000 

—  n  =  5000 

—  n  =  10000 

°  Supervised  Parameter  Estimation 


Figure  5.4: 
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Figure  5.5: 

is  the  number  of  cells.  Theoretically,  we  expect  that  as  the  dimensionality  increases, 
k  needs  to  increase  in  order  to  construct  a  partition  which  achieves  a  given  fixed  error 
criteria  (under  the  optimal  labeling).  This  in  turn  requires  more  labeled  examples 
in  order  to  pick  the  optimal  labeling  with  a  fixed  confidence.  This  effect  of  N  on  m 
is  clearly  undesirable  and  since  many  realistic  problems  have  high  dimensionality  we 
sought  an  approach  that  can  reduce  this  effect. 

As  unlabeled  examples  are  taken  to  be  abundant,  we  did  not  attempt  to  limit 
their  supply  when  choosing  the  algorithm.  Our  main  focus  was  to  limit  the  labeled 
sample  size.  The  Kohonen  neural  network,  by  principle,  fits  this  criteria  as  it  can 
utilize  primarily  unlabeled  examples  for  learning  the  decision  regions.  We  therefore 
considered  a  variant  of  its  architecture. 

The  limitation  of  the  voronoi  partition,  produced  by  the  Kohonen  network,  arises 
from  the  piecewise-linearity  of  the  cells  (from  (5.22)  a  voronoi  cell  has  hyperplane  bor¬ 
ders  with  its  surrounding  cells).  When  the  classification  problems  consist  of  pattern 
classes  that  are  not  linearly  separable,  it  takes  many  voronoi  cells  to  establish  a  rea¬ 
sonable  decision  border.  This  raises  the  labeled  sample  size  required  to  optimally 
label  the  partition.  However,  if  the  cells  have  nonlinear  borders  then  in  many  prob¬ 
lem  situations  one  can  do  with  fewer  cells  and  hence  considerably  reduce  the  labeled 
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sample  size. 

The  question  therefore  was  whether  a  neural  network  utilizing  primarily  unla¬ 
beled  examples  in  a  self  organizing  unsupervised  learning  can  produce  non  linear  cells 
and  adapt  them  to  achieve  a  good  partition,  in  particular,  a  partition  that  does  not 
have  too  many  cells  and  which  needs  fewer  labeled  examples  to  be  optimally  labeled. 

Our  simulation  results  (described  below)  show  that  a  two-layer  Kohonen  net¬ 
work,  featuring  self  organization  in  both  layers,  performed  with  the  above  desired 
characteristics  for  a  variety  of  problems.  On  some  problems  this  network  had  a  la¬ 
beled  sample  complexity  superior  (w.r.t  N)  to  the  voronoi-partition  classifier  (based 
on  the  regular  single  layer  Kohonen  net),  for  instance,  there  were  problems  in  which 
the  labeled  sample  size  was  a  constant  w.r.t.  N.  We  now  describe  the  simulations. 

The  architecture  that  we  considered  has  two  self  organizing  layers.  The  first  layer 
is  a  Kohonen  network,  i.e.,  a  collection  of  neurons  each  of  whose  inputs  is  a  vector  x 
which  represents  the  pattern-class  feature  vector.  The  ith  neuron  is  associated  with 
a  weight  vector  W{.  The  output  of  the  ith  neuron  is  a  real  scalar  gl  which  measures 
the  Euclidean  distance  from  x,  i.e.  gt  =  |.r  -  ml\.  These  neurons  adapt  their  weight 
vector  according  to  the  Kohonen  adaptation  rule  of  (5.21). 

The  second  layer  consists  of  neurons  each  having  a  weight  vector  y,-.  Their  input 
is  the  vector  g  of  outputs  of  the  first  layer  neurons.  The  neurons  of  this  layer  also 
adapt  their  y,  according  to  the  Kohonen  rule. 

Using  unlabeled  examples,  we  first  train  the  first  layer  producing  the  adapted 
wi  vectors.  Then  using  the  same  examples  we  train  the  second  layer  neurons.  This 
results  in  a  partition  of  the  transformed  feature  space  G  i.e.  X  is  transformed  to 
G  by  the  mapping  g  =  [\x  -  mi|,  \x  -  m2|, . . . ,  |.t  —  Each  of  the  second  layer 

neurons  is  associated  with  a  voronoi  cell,  i.e., 

Vi  =  {g:\g-  y,\  <  I g-  ?/.,!}• 
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Using  labeled  examples,  we  label  the  partition,  assigning  each  cell  with  the  label  of 
the  majority  of  the  examples  that  fell  in  it  (or  drawing  with  probability  |  its  label  if 
none  fell). 

Given  a  test  vector  ,t,  the  network  classifies  it  with  the  label  of  the  cell  in  which 
it  falls.  This  establishes  the  classifier  which  we  denote  as  the  2-layer  network. 

We  simulated  this  2-layer  network  and  compared  its  m  v.s.  N  performance  with 
that  of  the  1-layer  network  (i.e.,  the  usual  Kohonen  network).  Some  examples  of  the 
types  of  activation  regions,  i.e.,  cells,  that  the  2-la.yer  network  exhibited  are  displayed 
in  Figures  5.6  and  5.7.  In  these  the  input  space  X  dimensionality  is  N  =  3.  The 
region  between  the  two  mesh  surfaces  is  a  cell  corresponding  to  one  of  the  second 
layer  neurons.  The  first  pattern  class  is  represented  by  the  black  dots  and  the  second 
class  by  the  gray  dots.  The  nonlinearity  of  the  surfaces  is  apparent. 

We  trained  both  the  l-layer  and  the  2-la.yer  networks  on  a  problem  consisting 
of  four  ^-dimensional  cubes,  mutually  contained  as  Cube-i  C  Cube 2  C  Cube3  C 
Cube4  and  where  the  first  class  is  defined  as  Cubei\jCube3  and  the  second  class 
as  Cube.2  U  Cube4.  We  first  drew  unlabeled  examples  distributed  uniformly  and  then 
labeled  examples  distributed  uniformly  over  each  class.  The  case  of  N  =  2  is  displayed 
in  Figure  5.8.  We  measured  the  sample  complexity  m  w.r.t.  N  for  both  networks, 
which  is  needed  to  achieve  a  constant  error  rate  across  the  range  of  N.  Figure  5.9 
shows  the  labeled  sample  complexity  versus  dimension  N  which,  for  the  1 -layer- 
network,  increases  with  increasing  N.  The  2-la.yer  network  needed  only  a  constant 
number  of  labeled  examples. 

We  then  considered  random  classification  problems,  picking  30  clusters  per  class 
randomly  positioned  over  a  two  ring  region.  Figure  5.10  shows  an  instance  of  the 
problem  with  N  =  2.  The  vertices  of  the  mesh  indicate  the  position  of  the  first  layer 
neurons  and  the  black  and  gray  clusters  are  class  1  and  2  respectively. 


171 


172 


12 


2 


#  Labeled 
Examples 


Figure  5.8: 


1 - Layer 

2- Layer 


Dimension  N 


Figure  5.9: 


74 


We  measured  the  error  versus  labeled  sample  complexity  of  20  randomly  chosen 
1-layer  networks  and  20  randomly  chosen  2-la.yer  networks  per  dimension  N  and 
plotted  the  results  over  a  range  2  <  iV  <  12.  As  a  measure  of  comparison  between 
the  1-la.yer  and  2-layer  net  we  used  the  following  ratio 

mi-layer/m2- layer  _  Rm  _  ^ 

P error, 'll  Perror,l  Rp 

which  measures  the  number,  jRm,  of  1-layer  examples  needed  for  every  one  2-layer 
example  in  order  to  achieve  a  ratio  of  Rv  2-la.yer  misclassifications  per  one  1 -layer 
misclassification.  The  higher  it  is  the  worse  the  1-la.yer  performance  w.r.t  the  2- 
layer  performance.  As  seen  in  Figure  5.11,  this  ratio  increases  as  N  increases.  This 
suggests  that  on  average,  the  2-la.yer  network  requires  fewer  labeled  examples  than 
the  1-layer  network,  for  the  random-clusters-on-rings  problems,  and  the  saving  in 
labeled  examples  increases  with  the  input  dimensionality. 

The  ring  problems  are  particularly  difficult  for  a  linear  partition  since  class  1 
encloses  class  2  and  the  difficulty  becomes  worse  with  multiple  rings.  There  are  a 
host  of  easier  problems,  e.g.,  a.  cluster  in  the  shape  of  a  “C”  next  to  another  cluster, 
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Figure  5.11: 


but  not  containing  it.  Here  the  1-layer  network  performs  better  (i.e.,  fewer  labeled 
examples  are  needed  for  same  error)  then  in  the  multiple  ring  case,  but  the  2-layer 
net  still  does  better  than  the  l-layer. 

We  also  ran  both  types  of  networks  on  problems  with  classes  that  consist  of 
randomly  positioned  non  overlapping  clusters  but  not  on  ring  contours  as  before. 
Again,  performance  represented  the  labeled  sample  complexity  versus  dimensionality 
N.  Here,  the  1-layer  performed  as  well  as  the  2-layer  net.  Trying  to  decrease  the 
number  of  cells  in  the  second  layer  resulted  in  poor  performance.  This  is  due  to  the 
limitation  of  the  nonlinearity  of  the  cells.  The  decision  borders  that  can  be  achieved 
with  partitions  based  on  the  2-layer  architecture  are  not  better  than  those  that  the 
usual  voronoi  partition  achieves,  when  considering  the  average  performance  over  these 
types  of  randomly  generated  problems. 

The  above  ideas  fit  under  a.  common  strategy  of  sample  reduction,  namely  fitting 
the  family  of  possible  classification  decision  regions  to  the  family  of  classification 
problems.  The  2-la.yer  network  reduced  the  sample  complexity  for  problems  where 
the  class  clusters  are  distributed  over  Ar- dimensional  spherical  contours. 
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5.7  The  Method  of  fc-Means 


The  Kohonen  neural  network  uses  an  adaptation  rule  for  the  neural  weights  that 
is  similar  to  the  k-means  algorithm  (cf.  MacQueen  [36]).  This  procedure  adapts  k 
vectors,  j/x, . . .  ,yk ,  so  to  minimize  the  empirical  mean  square  error  (MSE) 


where  Y  =  [yi, . . .  ,yk]  with  j /,  being  A-dimensional  column  vectors  and  the  x,-  are 
the  AT-dimensional  example  vectors.  The  matrix  Y  defines  a  voronoi  partition  with 
k  cells.  The  true  MSE  is  denoted  by 


e(F)  =  E  min Jx-yjl2 

where  expectation  is  w.r.t.  the  underlying  pattern  mixture  density  /(x).  A  necessary 
condition  for  minima  of  e(Y)  is  that  the  partition  must  have  its  k  vectors  as  the 
centroids  of  the  corresponding  cells  (cf.  Gersho  A.  [37]). 

Using  the  theory  of  uniform  SLLN  we  can  find  the  sufficient  number  of  examples 
x,,  such  that  for  any  partition  F,  |e(F)  —  en(F)|  <  e  (cf.  Pollard  [21]).  It  follows 
that  an  algorithm  which  finds  a  partition  Y'  that  minimizes  en(Y)  in  effect  finds 
a  partition  which  minimizes  e(F)  to  within  e  accuracy.  This  idea  can  be  used  for 
classification  problems.  Suppose  one  of  the  r  minima  of  e(Y),  denoted  by  Y*,  is  a 
partition  for  which  Perror  =  PBayes ■  We  can  then  use  unlabeled  examples  together 
with  an  algorithm  that  locates  the  r  local  minima  of  en(Y)  to  estimate  the  local 
minima  of  e(F)  and  use  labeled  examples  to  choose  the  best  performing  partition 
out  of  the  r  possible  ones.  This  would  require  an  algorithm  which  uses  unlabeled 
examples  to  discover  consistent  estimates  of  the  r  global  minima  of  e(Y)  and  hence 
would  be  similar  to  the  mode  estimation  of  Section  5.5. 
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We  look  at  the  problem  in  which  r  =  1,  i.e.,  e(Y)  has  one  global  minimum 
and  hence  the  labeled  examples  are  not  needed  for  choosing  one  of  several  possible 
partitions,  but  just  to  label  one  partition  estimate,  Y,  of  Y*. 

We  first  present  the  algorithm  and  then  derive  its  sample  complexities. 


Algorithm  S: 

The  setting:  C  C  R2Ar  is  a  compact  parameter  space,  Y  —  [yi,  tj2 ]  €  C,  where  Y  indexes 
a  two-vector  voronoi  partition. 

The  MSE  is  defined  as 


e(Y)  —  E  min  lx  —  yA2. 

1<J<2 

The  two  pattern  classes  are  such  that  there  exists  a  two-vector  voronoi 
partition,  Y *  (E  C,  being  the  partition  of  the  Bayes  classifier,  and  such  that 
e(Y*)  is  the  unique  global  minimum  of  the  function  e[Y). 


Given: 

Begin: 


End. 


m  labeled  examples  and  n  unlabeled  examples  drawn  according  to  an  un¬ 
known  mixture  /(x). 

1)  Using  the  unlabeled  examples  find  the  point  Y*  which  minimizes  the 
empirical  MSE,  i.e., 


1  n 

Y*  =  arginfyeC—  ]T  mm  |x*  -  y3\2. 

2)  Form  the  hyperplane  perpendicular  to  the  line  through  y*t  i  =  1,2, 
and  which  passes  through  their  midpoint. 

3)  Label  the  two  decision  regions  across  the  hyperplane  by  the  label  of 
the  majority  of  the  examples  on  either  side. 


As  an  instance  of  such  a  problem,  consider  for  simplicity,  the  classification  prob¬ 
lem  which  consists  of  two  pattern  classes,  each  in  a  cluster  sufficiently  separated  so 
that  the  partition,  Y* ,  defined  by  the  hyperplane  perpendicular  to  the  line  through 
the  two  cluster-centroids,  separates  the  two  clusters  and  achieves  the  global  minimum 
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of  t(Y).  We  now  find  the  sufficient  number  of  unlabeled  examples,  n,  and  labeled 
examples,  m,  to  learn  the  Bayes  rule  to  within  small  error  and  high  confidence. 

Unlabeled  examples  are  used  to  compute  the  empirical  MSE  en(Y).  In  order  for 
minimization  of  the  empirical  MSE  to  yield  a  good  partition  we  need 


P  sup 

\[j/i.y2]eC' 


1 


Emin{|a:  -  j/i|2,  \x  -  y2 12} - ^min^x,-  “  2/i|2>  I®*'  “  V^2} 


ni= i 


>  ej  <  8/2 

(5.23) 

where  8, e  are  two  arbitrarily  small  positive  constants,  and  yi,y2  are  the  two  vectors 
defining  a  partition  Y  whose  two  regions  are 

Ri  =  {x  :  \x  -  j/i|  <  |*  -  y2 1},  R2  =  {x  :  \x  -  y^  >  \x  -  Vi\} 

and  C  is  a  compact  subset  of  ]RN  x  TRN  which  contains  the  optimal  [l/f ,  y^'based 
partition  which  achieves  the  minimum  MSE. 

In  order  to  use  Theorem  3.9  we  need  to  define  a  class  of  bounded  functions.  Let 
9vun(x)  =  min{|a;  -  ?/i |2,  \x  -  y2\2}.  Then 


P  sup 
\[yi  ,3/2 ]6C 


1 


<  P  (  sup 

\(MiV2)€C 

+  P  f  sup 

\[yi  ,3/2  ]€C 


>  e 

J 

1  A 


|®#3/1,3/2(X)  #3/1 ,3/2  (^t) 

71  i= 1 

f  I  " 

L9v  uyAx)dP 

J  D  •t'  i 

f  1  71 

JDc  9yuvAx)dP  ~  ~  i ,j/2 (xi)lxi€D 

where  D  is  a  compact  subset  of  IR^.  Define  the  class 


>  e/2j 

>  e/2>)  (5.24) 


7~~l  —  { ,t/2  (^JlicGD  •  E?/l »  ?/2]  £  C}- 


The  functions  in  H  are  bounded  since 


min{|a:  -  ?/x|2,  \x 


?/2|2}  <  2|a:|2  +  2|.r|(|?/1|  +  \y2\)  +  |j/i|2  +  I2/2I2  <  M 


as  |x|2  is  bounded  over  the  compact  region  D  and  |yx |2,  |?/2|2  are  bounded  over  the 
compact  set  C.  Theorem  3.9  can  be  used  to  bound  the  first  term  of  (5.24)  by  8/ 4. 
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The  second  term  is  bounded  by 


P 


(  sup  1 

9yuV2  {%)dP 

+  sup 

Dc 

[vi  1  yaJeC1 

n 

t3/2  (xi)lxieD° 

i= 1 


and 


sup 

[3/1  >2/2]  ec 


JDc9ylM(x)dP  <JDc 


sup  0J,, 

[yii3/2]€Cf 


3/2 


(x) 


dP  <  JDc\x\2dP  +  M1JDJx\dP  +  M2 


where  Mi,  M2  <  00  and  we  used  the  fact  that  |j/i|2,  |j/2 12  are  bounded  over  C.  We 
assume  that  the  class  mixture  /(x )  is  such  that  there  exists  a  compact  D  that  makes 
these  last  integrals  arbitrarily  small;  in  particular,  for  the  case  in  which  the  class 
clusters  are  bounded  we  can  let  D  be  a  ball  which  contains  both  clusters  which 
makes  these  integrals  equal  zero  since  the  probability  measure  (corresponding  to  the 
mixture  distribution)  is  zero  outside  the  ball.  Hence  we  can  bound  the  the  right  side 
by  an  arbitrarily  small  quantity  A  >  0.  As  on  page  67,  the  second  term  of  (5.24)  is 
then  bounded  by  6/4  where  S  is  a  given  arbitrary  confidence  parameter. 

Proceeding  to  find  a  bound  on  the  first  term  of  (5.24)  by  using  Theorem  3.9,  we 
only  need  to  calculate  the  VC  dimension  of  the  class  7i  of  functions.  First  we  find  the 
VC  of  the  class  A  of  functions  (|x  -  a\ 2  :  a  €  IRN}.  A  function  here  can  be  expressed 
as  a  linear  combination 


N  N  2N+1 

\X  -  =  ^2  X1  -  2  X]  aix*  +  M2  =  X]  Oii<j>i(x) 

i=l  i— 1  t=l 

where  cv,  are  constants  and  the  <^(x)  are  basis  functions.  The  class  of  graphs  of 
these  functions  has  VC  =  2N  +  1  by  Theorem  3.6  and  by  Definition  3.5  it  follows 
that  VC  (A)  =  2N  +  1.  The  graphs  of  functions  of  H  are  intersections  of  graphs  of 
functions  of  A  intersected  with  a  fixed  set  {(x,^)  :  0  <  z  <  ligo})  i-e., 


{(x,z)  :  0  <  z  <  |x  -  t/i|2}  C  {(x,z)  :  0  <  z  <  |x  -  j/2|2}  C  {(x, z)  :  0  <  z  <  I^d}. 


180 


Given  an  m- sample,  the  number  of  dichotomies  achieved  by  such  class  of  graphs  is  at 
most  mvc^mvc^\  Finding  the  largest  m  such  that  it  equals  2m  yields 

VC(H)  <  2(4 N  +  2)  log(4iV  +  2). 


Plugging  this  into  Theorem  3.9  we  have  the  sufficient  number  of  unlabeled  examples 
n  in  order  for  (5.23)  to  be  satisfied  is 

n  >  ((16AT  +  8)  log(4jV  +  2)  log  +  log  y)  . 

With  e-accuracy  when  estimating  e(Y)  by  en(Y),  implies  we  can  estimate  Y*  = 
[f/1,1/2]  by  Y*  =  [?/*,  1J2}  s.t.  | y*  —  y* |  <  a,  i  =  1,2  where  a  >  0  is  arbitrary  small 
depending  on  e,  Y*,  and  Y*.  Developing  on  this  theme,  we  have 

e(n  -  €  <  en(Y*)  <  en(Y*)  <  e(Y*)  +  e 
where  the  middle  inequality  follows  from  Y*  =  argminy€Cen(yr).  So 

|e(F*)-e(y*)|  <  2e. 

Now  assume  that  e(-)  is  continuous  and  1-1  at  least  in  some  ball  around  Y*  (for 
conditions  cf.  Apostol  [28]  p.  370).  Then  (cf.  Rudin  [27]  p.  90)  its  inverse  is 
continuous  there  and  therefore 


e(n  -  e(n 


<  2e 


<  a 


for  small  enough  e  >  0,  where  o-  is  arbitrarily  small  and  depends  on  e,  Y*,  Y*. 
Hence  the  learner  draws  n  (as  above)  unlabeled  examples  then  locates  the  argmin 
of  e„(F).  This  vector,  Y*,  is  o-close  to  the  vector  Y*  at  which  the  minimum  of 
the  true  MSE,  e(-),  occurs.  That  implies,  \y*  -  y*\  <  a,  i  =  1,2.  Recall  that  by 
assumption,  the  partition  based  on  the  hyperplane  which  is  associated  with  this  Y* 
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achieves  the  minimum  Perror •  From  Section  4.1  we  have  Perror  JW,(l  +  0(a2(«))) 
when  labeled  optimally.  From  Section  4.3  we  need  m  =  A  log  ,  for  some  constant 
A,  labeled  examples  to  guarantee  that  the  probability  of  not  labeling  optimally  is  at 
most  8/2. 

Combining  all  of  the  above  it  follows  that  with  confidence  >  1  —  6,  and 
n  >  ((16iV  +  8)  log(4iV  +  2)  log  +  log  y) 

unlabeled  examples  and 

m  =  A  log  7 
o 

labeled  examples,  algorithm  S  which  minimizes  the  empirical  MSE  finds  a  classifica¬ 
tion  rule  which  has 

P error  —  P]3ayes{\  "F  CQ!  (c)) 

for  some  positive  constant  c  and  where  a(-)  depends  on  the  problem  by  depending 
on  the  MSE  function  e(Y). 
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Chapter  6 
Conclusions 


Based  on  the  finite  sample  complexity  results  we  now  discuss  their  implications  on 
the  worth  of  a  label  example  under  several  different  scenarios.  All  constants  c;, 
z  =  0,1,2...  are  finite  and  positive. 

First  we  compile  the  results  concerning  the  learning  of  a  classification  problem 
with  an  underlying  Gaussian  mixture.  In  the  following  discussion,  /  will  denote  the 
unknown  underlying  Gaussian  mixture  of  two  equiprobable  pattern  classes,  unless 
stated  otherwise. 

We  showed  in  Section  4.2  that  with  knowledge  of  both  the  parametric  form  of  / 
and  that  /  is  a  member  of  an  identifiable  family,  algorithm  M  suffices  with 

N2  (  1  1\ 

n„  =  c1^^|og-  +  1°gjj 


unlabeled  examples 


C2  log  ^ 


labeled  examples.  We  compare  this  to  the  purely- labeled  sample  scenario  of  Section 
4.1  where  a  learner  using  algorithm  E  required 


labeled  examples  to  learn  the  same  problem.  Clearly  there  is  a  reduction  by  intro¬ 
ducing  ??a/  unlabeled  examples.  Dividing  um  by  the  difference  mE  —  mu  yields  a 
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rough  estimate  of  the  worth  of  a  label  example,  namely 


c3N2 


(6.1) 


e38  log  N 

unlabeled  examples.  This  is  polynomial  in  N,  and  i. 

In  Section  5.2  we  showed  that  when  learning  the  same  classification  problem  but 
not  knowing  the  parametric  form  of  /  nor  the  fact  that  the  class  of  Gaussian  mixtures 
is  identifiable,  algorithm  K  required 

^gJVlog(5+log  N)  ^ 

TIk  —  c4  n  log 


E 

C^iogN 
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unlabeled  examples  and 


mK 


cslogj 


labeled  examples, 
is  worth  roughly 


As  before,  comparing  this  to  we  have  that  a  labeled  example 


c<3 1 los(5+los  N) 
N  log  ./Ve2l°s  x 


(6-2) 


unlabeled  example. 

This  is  roughly  exponential  in  N  and  \  and  is  therefore  considerably  more  than 
the  previous  polynomial  expression.  As  discussed  in  Chapter  5,  the  same  n k  and 
m,K  apply  also  for  algorithm  K  when  learning  a  function  with  a  more  general  form 
than  the  Gaussian  mixture  /.  So  the  reason  that  it  takes  significantly  more  unlabeled 
examples  to  learn  /  with  algorithm  I<  than  with  algorithm  M  is  due  to  the  larger 
complexity  of  the  family  of  functions  of  which  /  is  a  member  under  algorithm  K. 

Therefore  when  learning  the  same  Gaussian  mixture  /,  under  the  nonparametric 
scenario,  a.  labeled  example  is  worth  an  exponentially  more  unlabeled  examples  than 
in  the  parametric  scenario. 

We  can  equivalently  express  the  above  results  by  showing  Perror(m,n)  as  a  func¬ 
tion  of  m  and  n,  i.e.,  Perror(m,n)  =  <y(m)  +  h.(n).  For  the  mixed  sample  learning,  in 
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both  the  parametric  and  the  nonparametric  scenarios  the  function  g(m)  is  0(e~b°m), 
m  — >  oo,  for  some  constant  b0  >  0  independent  of  N.  For  the  parametric  case  the 
function  h(n)  is  O  n  ^  oo,  and  in  the  nonparametric  case  the  function 

h(n)  is  0  n  —¥  oo,  where  bi,b2  >  0  are  independent  of  N.  Cover  Sz 

Castelli  [5]  report  a  similar  expression  for  the  Perror ,  namely  a  polynomial  decrease 
in  n  and  exponential  decrease  in  in  however  our  results  specifically  point  out  the 
dependency  on  the  dimensionality  N.  For  fixed  N,  it  is  clear  that  in  both  scenarios 
the  number  of  unlabeled  examples  is  exponentially  more  than  the  number  of  labeled 
examples.  With  variable  N  we  conclude  that  the  value  of  a.  labeled  example  is  ex¬ 
ponentially  more  when  in  the  nonparametric  scenario.  In  the  purely  labeled  sample 
case,  Perror{m)  is  0  m  — *  °°>  f°r  &4  >  0  is  constant  with  N.  Hence  the 

rate  of  decrease  of  the  error  w.r.t.  m  is  exponential  when  unlabeled  examples  are 
introduced  to  the  learning.  Without  unlabeled  examples,  it  is  only  polynomially  fast. 

In  Sections  4.4,  4.5,  we  considered  the  same  Gaussian  mixture  problem,  but 
with  general  a  priori  class  probabilities,  p  and  1  —  p.  The  algorithms  Ep,  M\,  and 
M2  assume  that  /  is  a  member  of  a  parametric  family  of  identifiable  mixtures.  The 
parameter  vector  indexing  an  /  in  this  family  is  [9,p],  where  p  is  the  class  “1”  a  priori 
probability  and  0  is  the  two-means  vector.  W.l.o.g  we  assume  p  <  1  —  p.  We  now 
determine  the  tradeoff  between  the  number  of  labeled  and  unlabeled  examples,  under 
the  case  of  general  0  <  p  <  \  in  the  following  three  scenarios:  (1)  both  6  and  p  are 
learned  using  a.  purely  labeled  sample  (2)  9  is  learned  using  an  unlabeled  sample,  p 
is  learned  using  a  labeled  sample  (3)  both  6  and  p  are  learned  using  an  unlabeled 
sample.  (The  last  two  are  the  mixed-sample  cases.) 

The  sample  complexities  sufficient  to  achieve  PeTT0T  <  Fgayes(l  +  c7e)  with  con¬ 
fidence  >1—6'  are  as  follows:  in  (1),  for  constant  p,  the  sufficient  labeled  sample 
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N  N 

mE„  =  cs  —  log  — . 

When  p  varies  it  effects  m  through  a  factor  polynomial  in  1,  however  as  p  — ►  0,  m  — *  1 
while  P eTr or  0. 

In  (2),  for  fixed  p,  the  labeled  sample  complexity  decreases  to 


1  1  1 
mMl  =  c9—  log  - 

ez  e 

which  is  independent  of  N,  on  account  of  introducing 

N3  log 1 

IlMi  =  Cio - ^ — - 

unlabeled  examples.  The  parameter  p  effects  n  through  a  factor  of  log2  (i)  and  m 
through  a  factor  polynomial  in  ^  but  n  — >  0  and  m  — >  1  as  p  — ►  0. 

In  case  (3),  the  labeled  sample  m.M2  is  practically  a  constant,  while  for  a  fixed  p, 
the  unlabeled  sample  size  is 


^ M2  —  ^11' 


iV3 log  - 


For  variable  p,  n  depends  on  p  through  a  factor  of  pl4 


,  however  as  p  — >  0,  n  — >  0, 

and  m  — *  1. 

From  the  above  it  follows  that  n,m,  are  effected  by  p  in  the  worst-case  through 
a  factor  polynomial  in  K  When  p  is  estimated  using  the  unlabeled  sample,  n  grows 
w.r.t.  p,  faster  than  the  rate  at  which  m  grows  w.r.t.  p  under  the  scenario  where 
the  labeled  sample  is  used  to  estimate  p.  So  there  is  a  tradeoff  between  cost  and 
amount  of  examples —  unlabeled  examples  are  cheaper  but  more  of  them  are  needed. 
The  same  situation  also  applies  for  the  estimation  of  the  means — when  unlabeled 
examples  are  used,  more  of  them  are  needed  than  when  labeled  examples  are  used. 

For  p  — >  0,  the  labeled  sample  goes  to  1,  and  the  unlabeled  sample  goes  to 
0.  This  is  anticipated  since  small  p  implies  that  one  of  the  pattern  classes  has  a 
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negligible  effect  on  the  decision  rule  and  hence  can  be  ignored,  letting  all  of  IR^  have 
the  complement  label  while  suffering  a  small  misclassification  error.  The  algorithms 
that  are  used,  all  need  at  least  one  labeled  example. 

Observing  the  number  of  unlabeled  examples  needed  to  reduce  from  mBp  to 
we  have  as  a  rough  estimate  that  one  labeled  example  is  worth 

-/V2  log  i 

°12  e4pn  log  TV 

unlabeled  examples,  which  is  a  polynomial  in  K 

We  also  partially  analyzed  algorithm  M  which  is  based  on  the  MLE  technique, 
for  the  two  classes  having  the  same  non  unit  covariance  matrix.  We  encountered  some 
difficulty  in  the  part  of  the  proof  (of  Theorem  4.2)  where  it  is  necessary  to  show  the 
independence  of  the  function  4>(0)  for  all  N  greater  than  some  constant.  However  in 
all  other  parts  there  was  no  difficulty  (the  work  is  more  involved  than  for  the  unit 
covariance  case).  In  particular,  we  let  the  parameter  8  =  [//i,^2,  ^  1]  and  under 
the  condition  that  the  unknown  mixture  f(x\80)  has  a  nonsingular  covariance  matrix 
H0,  we  can  define  a  compact  set  containing  8q  and  not  containing  any  points  with 
singular  covariance  matrices.  This  enables  us  to  define  a  bona  fide  class  of  bounded 
functions,  indexed  by  the  parameter  8 ,  and  thereby  be  able  to  apply  the  uniform 
SLLN  theorems.  We  conjecture  that  for  the  part  of  the  proof  which  was  not  yet 
completed,  there  exists  a  way  to  find  the  necessary  symmetry  in  the  integrals  that 
define  $(0)  such  that  for  N  >  N0,  sup„gB((W)  $(0)  is  constant,  where  3  <  N0  <  oo. 
Based  on  this,  the  unlabeled  sample  complexity  will  still  remain  polynomial  in  N, 

l 

«• 

In  Section  5.7  we  considered  the  k- means  method — a.  nonparametric  method  to 
learning  classification  which  is  based  on  an  ad  hoc  clustering  approach.  As  before, 
the  classifier  here  consists  of  a.  partition  and  a  labeling  of  each  of  the  decision  regions. 
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The  partition  comprises  of  voronoi  cells,  each  associated  with  a  vector  ?/;,  hence  the 
partition  can  be  indexed  by  a  matrix  Y  of  k  vectors  y,  €  IR^,  1  <  i  <  k.  The 
classical  vector-quantization  techniques  (cf.  Gersho  [37])  uses  such  a  partition  as  a 
mapping  from  the  input  space  containing  the  vector  x  to  the  output  space  which  is 
the  finite  set  of  vectors  yj,  1  <  i  <  k.  A  common  measure  of  discrepancy  between  the 
random  input  x  and  the  output  y  is  the  means  squared  error  (MSE).  Thus  a  good 
partition,  for  vector  quantization,  is  one  which  minimizes  the  MSE.  A  classifier  can 
be  constructed  based  on  the  partition  provided  that  each  voronoi  cell  is  assigned  a 
label  of  either  class.  There  are  several  labelings  possible  and  the  learner  is  to  pick  the 
one  which  minimizes  the  probability  Perror  of  misclassification.  In  general,  a  partition 
which  has  a  minimum  MSE  does  not  necessarily  has  a  minimum  PerTor  under  optimal 
labeling  but  for  some  problems  the  MSE  partition  does  yield  a  good  classifier. 

We  considered  the  problem  for  which  there  exists  a  unique  minimum  MSE  par¬ 
tition  with  Perror  =  Psayes •  Using  algorithm  S  we  showed  that  it  is  sufficient  to 
have  riMS  be  polynomial  in  j,  log|,  and  Ar,  to  achieve  PeTTOr  which  is  Ci3a2(e) 
where  a  is  a  function  depending  on  the  smoothness  of  the  MSE  function  e(Y)  = 
Emin{|x  —  yi|2,|ar  —  y2\2}  over  the  region  C  C  IR2Ar.  Here  a  is  analogous  to  the 
function  /?.  of  Section  5.5  in  that  they  both  represent  the  worst  misclassification  er¬ 
ror  deviation  when  the  uniform  deviation  between  the  empirical  and  the  true  means 
over  a  class  of  functions  is  at  most  e.  The  labeled  sample  size  niMS  is  practically  an 
absolute  constant. 

When  compared  to  the  results  of  algorithm  K,  it  may  first  seem  surprising  that 
this  nonpar ametric  &- means  classification  scheme  requires  only  a  polynomial  number 
of  unlabeled  examples.  However  we  note  that  algorithm  S  is  really  parametric  since 
it  searches  for  an  optimal  partition,  or  its  associated  parameter  Y*,  in  a  Euclidean 
parameter  space,  (we  still  call  this  a.  nonparametric  approach  to  classification  since 
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knowledge  of  the  parametric  form  of  the  class  densities  is  not  required.)  So  we  should 
expect  its  sample  complexities  to  be  of  the  same  order  of  magnitude  as  for  algorithm 
M  which  is  also  based  on  a  search  for  a  function  in  class  of  parametric  functions. 
Consequently  we  do  not  anticipate  such  a  parametric  approach  to  perform  satisfacto¬ 
rily  on  a  rich  nonparametric  family  of  density  mixtures  since  the  complexity  of  such 
a  family  of  functions  mismatches  the  complexity  of  the  parametric  function  class  on 
which  algorithm  S  is  based.  However  with  heuristics,  as  for  instance  in  the  different 
variants  of  the  LVQ  algorithm  (cf.  Kohonen  [24])  where  the  partition  is  adjusted 
using  labeled  examples,  it  may  be  possible  to  improve  the  classification  error  ad  hoc. 

We  now  discuss  another  possible  approach  to  estimate  the  modes  of  /  by  using 
the  kernel  technique.  In  Chapter  5,  algorithm  I<  estimated  f(x)  using  the  kernel 
technique  uniformly  for  all  x  E  IR^,  i.e.  we  used  sup^j^N  as  the  estima¬ 

tion  discrepancy.  However  only  the  modes  of  /  were  needed  by  the  algorithm  for 
constructing  the  decision  border.  This  suggests  that  perhaps  it  suffices  to  estimate 
the  functional  values  of  /  at  its  k  modes  1  <  *  <  k.  The  difficulty  is  that  the 
modes  ?/,■  are  not  known  hence  one  cannot  even  specify  what  is  there  to  be  estimated. 
However  suppose  that  the  learner  does  know  that  the  modes  of  /  are  restricted  to  be 
in  a  ball  B  of  radius  p  centered  at  some  point  .r0  of  IRN.  He  can  then  consider  an 
e-cover,  S,  (w.r.t.  the  Euclidean  norm)  of  B  having  a  cardinality  which  is  bounded 
above  by  =  {^f)N  ■  (For  brevity  we  also  denote  it  by  s).  By  definition,  for  any  point 
x  E  B  there  exists  a  ijj  G  S  such  that  \rjj  —  x|  <  e.  For  continuous  /  and  small  enough 
e  >  0  we  can  guarantee  that  there  exist  yj  G  5,  each  e-close  to  its  corresponding  rjj 
such  that  |/(?/?)  —  /(?/,)!  <  «  for  arbitrary  a  >  0,  and  1  <  j  <  k.  (Note,  the  learner 
knows  the  points  yj  ,  1  <j<s.) 

The  learner  may  then  use  the  kernel  technique  to  generate  the  s  values,  fn(yj ), 
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as  a-estimates  of  the  points  f(yj),  i.e., 

Hii; 

where  x;,  1  <  i  <  n  are  the  unlabeled  examples,  such  that  | fn(yj)  ~  f{yj) I  <  a-  It 
follows  that  for  each  ft,  l  <  i  <  k,  there  is  a  subset  A,  of  S  which  contains  points 
rjj  such  that  \fn(yj)  ~  f(r)i) I  <  2a,  and  from  above,  there  exists  such  a  point  which 
is  e-close  to  rji.  Thus  using  a  variant  of  algorithm  K  the  learner  can  obtain  e-close 
estimates  of  the  modes  of  /. 

It  only  remains  to  show  the  sample  complexities  for  this  approach.  The  analysis 
follows  identically  as  the  one  in  Section  5.2,  except  now  the  class  K.„  is  defined  as 

K.  =  {*„,(*)  :  vs  6  S} 

instead  of 

which  was  the  case  in  Chapter  5  where  fn(y)  estimated  f(y)  uniformly  for  y  €  IR^. 
We  can  use  the  cardinality  |  instead  of  a  covering  number  (and  hence  not  requiring 
VC(/C(T)).  From  above  we  have  |/Ca|  =  se.  Following  (5.12)  and  using  the  cardinality 
w.r.t.  the  appropriate  accuracy  yields  a  sample  complexity 

^  N  log  ( 5+  log  N)  p 

n  =  c14 - — r~ - l°g"T 

£2\o&N  €0 

which  differs  from  the  n  of  Theorem  5.1  in  the  4  instead  of  the  13.  So  this  other 
approach  contributes  to  an  exponential  reduction  in  the  number  of  unlabeled  examples 
from  our  algorithm  K  approach  however  the  resulting  unlabeled  sample  size  is  still 
exponential  in  N  and  j,  and  an  additional  constraint  is  for  the  learner  to  know  a 
priori  the  region  B  which  contains  the  true  modes  of  /. 

As  a  summary,  in  this  thesis  we  considered  an  interesting  simple  question  raised 
by  T.  M.  Cover  [5],  which  asks  what  is  the  tradeoff  between  labeled  and  unlabeled 
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examples  for  learning  a  classification  rule.  We  used  theory  from  statistics  and  mathe¬ 
matical  analysis  to  formalize  this  question  in  terms  of  a  probabilistic  setting  in  which 
sample  sizes  not  only  influence  the  misclassification  accuracy  but  also  the  confidence 
of  the  classification  decision.  It  became  clear  as  we  began  to  get  expressions  for  the 
value  of  a  labeled  example,  that  side  information  is  a  crucial  assumption  in  learnabil- 
ity.  That  is,  learning  by  examples  is  very  dependent  on  what  is  assumed  to  be  known 
by  the  learner  a  priori. 

It  seems  that  there  should  be  a  way  to  represent  both  examples  and  side  infor¬ 
mation  in  the  same  model  of  learning,  so  that  for  instance,  one  would  be  able  to  tell 
how  many  more  examples  are  needed  if  less  side  information  is  at  hand,  or  vice  versa. 
This  is  an  interesting  theoretical  question.  If  an  elegant  and  intuitive  model  could 
be  contrived  to  represent  it  then  its  consequences  will  have  strong  theoretical,  if  not 
also  practical,  implications  on  our  understanding  of  information  in  conjunction  with 
learning  by  examples. 

There  has  been  some  related  work  on  this  question,  cf.  Abu-Mostafa  [45]  repre¬ 
sents  this  in  terms  of  giving  the  learner  hints  and  uses  VC  dimension  arguments  to 
formalize  it.  Haussler,  Kearns  &  Schapire  [44]  consider  a  Bayesian-information  theo¬ 
retic  model  in  which  the  unknown  function  is  drawn  according  to  some  prior  density 
from  a.  function  class,  and  different  degrees  of  side  information  are  represented  by 
using  different  types  of  priors. 

From  our  work  we  saw  that  in  both  the  parametric  and  nonparametric  scenar¬ 
ios,  there  was  a  primary  function  class  defined,  which  was  indirectly  related  to  the 
class  of  mixtures  that  contained  the  unknown  underlying  mixture.  This  primary 
function  class  in  essence  represents  the  “engine”  of  the  so  called  uniform  SLLN — 
the  mathematical  machinery  that  produced  for  us  all  the  sample  complexity  results. 
For  instance,  going  from  the  nonparametric  kernel  technique  to  the  parametric  MLE 


191 


takes  us  from  the  function  class 


=  {K'M  :  x  €  E  C  IR"} 
to 

Q  =  {g(x\0)  =  log/(®|tf)lD(ar)  :  0  €  0  C  IR2"}  . 

Both  engine  classes  have  a  complexity  measure,  namely  the  covering  number  or  the 
VC-dimension,  as  we  showed  in  preceding  chapters.  The  complexity  of  the  class  JCa  is 
exponential  in  N  while  the  complexity  of  Q  is  linear  in  N.  So  it  seems  intuitive  that 
whatever  side  information  is  available  to  the  learner  in  the  parametric  scenario,  and 
which  is  abeyant  in  the  nonparametric  kernel  scenario,  could  possibly  be  represented 
by  some  function  of  the  difference  between  the  complexities  of  this  two  function 
classes.  It  is  as  though  the  teacher  points  his  finger  at  a  lower  dimensional  area 
in  some  high  dimensional  space,  and  thereby  communicates  to  the  learner  this  side 
information,  i.e.,  restricting  the  learner  to  search  for  the  unknown  function  over  a 
a  less  complex  set  of  functions  which  contains  the  unknown  desired  function.  This 
results  in  requiring  fewer  examples,  for  instance,  as  we  saw  in  the  significant  reduction 
of  the  unlabeled  sample  complexity  when  going  from  the  kernel  to  the  MLE  scenario. 
In  ongoing  research,  we  are  investigating  possible  models  that  represent  these  ideas 
in  a  formal  framework. 

We  have  used  the  majority  rule  algorithm  with  the  labeled  examples  for  all  the 
mixed-sample  learning  algorithms  when  choosing  the  labeling  of  the  partition  of  the 
classifier.  It  is  worth  noting  that  the  same  rule  can  be  used  in  the  operation  of  the 
classifier,  i.e.,  when  one  needs  to  test  a  hypothesis  of  having  class  “1”  or  “2”.  One 
can  take  r  test  examples  and  let  the  classifier  assign  a  labeling  to  each  example.  Then 
choose  the  hypothesis  which  corresponds  to  the  label  of  the  ma  jority  of  the  r  examples. 
As  in  the  classical  hypothesis  testing,  the  larger  r  the  better  performance,  i.e.,  the 
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lower  the  probability  of  making  a  bad  hypothesis  decision.  In  fact,  the  probability  of 
error  decreases  to  0  exponentially  fast  with  r. 

Extensions  to  the  case  of  more  than  two  pattern  classes  and  to  other  parametric 
forms  can  be  tackled  using  the  same  approach  as  we  have  taken  here.  In  the  para¬ 
metric  approach  one  would  need  to  have  identifiable  mixtures  and  the  MLE  analysis 
will  be  very  similar  to  the  Gaussian  case.  The  function  class  over  which  the  uniform 
SLLN  is  to  be  applied  will  possibly  have  a  different  complexity  but  in  most  cases 
we  expect  a  polynomial  growth  for  the  number  of  unlabeled  examples  w.r.t.  the  im¬ 
portant  variables  and  the  dimensionality  N.  In  the  nonparametric  scenario, 

it  would  be  interesting  to  extend  and  determine  other  classes  of  mixtures  where  the 
modes  of  a  mixture  /  can  determine  the  Bayes  partition,  in  particular  for  the  cases 
of  more  than  two  pattern  classes  and  for  nonlinear  decision  borders. 

An  interesting  extension  for  learning  classification  is  using  examples  that  can 
take  on  a  label  of  a  fuzzy  nature.  For  instance  one  type  of  such  examples  is  denoted 
by  1  <  i  <  l,  where  x,  e  IR^  is  a  feature  vector  and  0  <  y,  <  1  represents 

the  probability  that  xt  is  of  class  1.  If  yt  =  1  then  it  corresponds  to  having  a  labeled 
example  from  class  1,  while  if  y,  =  0  then  it  is  a  labeled  example  from  class  2.  When 
yt  —  ~  the  example  is  considered  as  unlabeled.  This  kind  of  examples  therefore  allow 
for  a  full  spectrum  of  confidence  in  the  label  (i.e.,  from  0  to  1)  and  may  be  useful  in 
situations  where  the  teacher  can  only  provide  likelihoods  or  confidences  about  the  true 
origin  of  a  particular  feature  vector  x.  Referring  to  this  type  of  examples  as  the  fuzzy 
kind  it  is  interesting  to  ask  what  is  the  sample  complexity,  /,  for  learning  a  decision 
rule  using  examples  of  this  kind.  One  approach  is  to  use  the  /-sample  to  estimate 
the  a  posterior  functions  p(l|x)  and  p(2|x)  which  directly  identify  the  Bayes  rule  as 
in  (1.1).  It  suffices  to  estimate  p(l|x)  as  p(2|x)  =  1  -p(l|x).  Let  us  assume  that 
the  mixture  density  f(x)  and  the  class  conditional  densities  are  parametric.  In  this 
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case  the  function  p(l|x)  is  a  member  of  a  parametric  class  of  functions,  indexed  by  a 
finite  real  vector  <f>  and  denote  it  by  p^(l|x),  where  <P  =  [p\  ■  P2-  $1 ;  $2]  also  indexes  the 
mixture  f(x\(j>)  in  the  class  of  mixtures.  We  can  let  the  teacher  draw  {xl}  t/,)  according 
to  some  arbitrary  probability  density  g(x,y)  and  define  a  discrepancy  measure  as 
E  {P4,(l\x)  -  y  f  where  y  is  a  function  of  x  and  represents  the  true  unknown  ?v0(l|x), 
and  (j),  <j>o  €  A  C  IRfc  where  k  does  not  necessarily  equal  the  dimension  N  of  the 
feature  vectors  x  (the  expectation  is  w.r.t.  g(x,y)).  We  can  apply  the  uniform  SLLN 
over  the  class  of  functions  %  =  {h^x^y)  =  (p^(l|x)  —  y)2  :  4>  €  A  C  IR*}  to  get 
sup |E/i0(.t,  y)  -  h(xi,yt)\  <  e  with  confidence  >  1  -  8. 

The  learner  then  finds  a  function  h which  minimizes  the  empirical  mean  j  ]C!=i 
h^xi.xji)  over  all  <j>  €  A,  and  for  sufficiently  small  t  >  0  it  follows  that  p^(l\x)  is 
a  close  estimate  of  the  true  unknown  a  posterior  p^0(l|x)  in  the  MSE  sense  w.r.t. 
any  probability  density  g(x,y)  (due  to  the  distribution  independent  results  of  the 
uniform  SLLN  theorems).  In  order  for  this  to  hold  a  sufficient  sample  complexity 
can  be  calculated  by  determining  a  bound  on  the  covering  number  of  ? i.  Roughly 
speaking  since  the  parameter  set  A  is  in  IRfc  which  is  also  the  parameter  space  of  the 
mixture  density  f{x\<j))  then  the  size  l  of  the  fuzzy  sample  will  differ  by  a  constant 
factor  (w.r.t.  the  dimensionality  k  of  the  parameter  space)  from  the  unlabeled  sample 
size  n  that  is  sufficient  to  estimate  the  parametric  mixture  f(x\<j>).  In  all  the  mixed 
sample  cases  investigated  earlier  we  saw  that  the  number  of  labeled  examples  m  is 
only  clog  |  which  is  significantly  smaller  than  n  so  therefore  l  is  of  the  same  order  as 
total  m  +  n. 

Hence  when  learning  a  classification  problem  one  can  either  use  unlabeled  exam¬ 
ples  to  estimate  the  class  conditional  densities  and  subsequently  the  decision  regions 
and  then  label  them  using  the  labeled  sample  or  use  a  fuzzy  sample  to  estimate  the  a 
posterior  functions  p(l|x)  and  p(2\x)  which  directly  result  in  an  estimate  of  the  Bayes 
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decision  rule.  In  both  cases  the  total  number  of  examples  is  roughly  the  same.  There 
are  many  other  types  of  samples  that  one  may  investigate,  for  instance,  introduce 
a  noise  component  to  the  labels  by  making  the  label  be  a  random  variable  2  which 
takes  on  the  value  y  (which  is  the  true  label)  with  probability  1  —  a  for  small  a  >  0 
and  takes  on  the  complement  label  yc  with  probability  a.  This  type  of  examples  are 
useful  when  representing  the  possibility  of  miscommunication  between  the  teacher 
and  the  learner.  Clearly  an  investigation  of  learning  with  such  examples  and  other 
variants  of  this  type  are  interesting  and  will  require  more  work. 

In  the  mixed  sample  approaches  we  considered  both  the  unlabeled  and  labeled 
examples  as  being  drawn  according  to  the  underlying  true  unknown  densities  fi(x) 
and  /2(x).  This  represents  a  natural  setting  in  which  there  is  no  teacher  which 
controls  the  learning  process  but  instead  a  passive  “nature”  presents  the  examples.  It 
is  a  suitable  representation  when  learning  is  primarily  done  using  unlabeled  examples 
as  was  true  in  the  cases  we  investigated  and  as  we  mentioned  above  side  information 
is  related  to  the  complexity  of  the  engine-class  of  functions  over  which  the  uniform 
SLLN  is  applied. 

However  when  dealing  with  a  purely  labeled  sample  or  a  fuzzy  sample  as  above 
there  is  an  obvious  role  for  a  teacher  to  represent  side  information —  having  an  active 
teacher  which  is  free  to  choose  a  particular  probability  density  g(x,tj)  for  randomly 
drawing  the  examples.  In  the  previous  discussion  regarding  the  fuzzy  sample  the 
sample  size  l  was  distribution  independent,  hence  the  freedom  to  choose  g{x,y)  was 
not  exploited.  When  the  teacher  uses  a  particular  distribution  to  draw  the  examples 
he  restricts  the  learner  to  search  for  the  unknown  function  in  an  effectively  simpler 
class  of  functions,  i.e.,  one  whose  complexity  is  the  covering  number  w.r.t.  probability 
distribution  g(x,y),  which  may  be  smaller  than  the  upper  bound  based  on  the  VC- 
dimension  of  (3.8).  This  complexity  measure  has  a  direct  bearing  on  the  sample 
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complexity  (see  (3.9))  hence  it  may  be  possible  to  reduce  the  sufficient  sample  sizes 
(both  for  the  purely  labeled  and  the  fuzzy  sample)  by  selecting  particular  probability 
densities  g(x ,  y)  which  is  effectively  showing  side  information.  When  the  sample  space 
is  discrete  and  the  functions  are  indicator  of  sets  (cf.  Benedek  k  Itai  [49])  it  is  clearly 
seen  that  the  sample  size  l  is  directly  related  to  the  distribution  g(x,y).  Using  these 
ideas  it  is  possible  to  choose  distributions  that  place  mass  only  on  “interesting”  or 
relevant  sets  of  the  class  and  therefore  in  effect  reduce  the  complexity  of  the  class 
resulting  in  a  reduction  of  the  sample  size.  For  more  related  work  see  Benedek  k  Itai 
[50],  Barlett  k  Williamson  [51]. 

The  subject  of  animal  learning  is  related  to  learning  with  labeled  and  unlabeled 
examples.  In  real  life,  an  animal  gets  penalized  when  making  a  wrong  choice.  The 
penalty  can  be  viewed  as  the  negative  label.  In  this  respect,  it  is  reasonable  to  expect 
that  an  animal  tends  to  minimize  the  number  of  labeled  examples  that  it  needs  for 
learning  basic  primitive  tasks  as  the  cost  of  negative  labeled  examples  is  high  (for 
instance,  a  negative  label  may  mean  the  animal  looses  a  limb  or  perhaps  its  life). 
Young  animals  in  the  wild  need  to  learn  very  quickly  (relative  to  humans)  in  order  to 
achieve  the  stage  in  which  they  no  longer  rely  on  their  parents  hence  considering  that 
labeled  examples  are  rare  or  costly  (especially  for  animals  who  do  not  have  a  language 
of  exact  communication)  it  is  rational  to  suppose  that  animals  rely  on  learning  with 
unlabeled  examples  which  are  abundant  in  their  natural  habitat  and  less  on  labeled 
examples  (of  course  the  genetic  factor  is  also  very  important  here  since  it  may  result 
in  many  fewer  things  necessary  to  learn).  The  biological  nervous  system  in  particular 
the  brain  of  the  animal  perhaps  uses  mechanisms  or  neural  architectures  which  put 
more  weight  on  learning  with  unlabeled  examples  and  minimize  the  need  for  labeled 
examples. 

The  self  organizing  neural  networks  that  we  considered  in  Chapter  5  are  similar 
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to  some  biological  neural  networks  (cf.  Kohonen  [24]).  It  is  known  that  topological 
maps  similar  to  those  which  arise  in  self  organizing  Kohonen  neural  networks  (which 
use  primarily  unlabeled  examples)  are  common  in  the  real  brain.  In  our  simulations 
we  saw  that  specific  neural  architectures  need  fewer  labeled  examples  therefore  it  is 
conceivable  that  biological  neural  networks  possess  architecture  that  need  fewer  la¬ 
beled  examples.  This  may  be  achieved  by  specialization,  i.e.,  networks  that  represent 
decision  rules  which  are  based  on  partitions  having  separating  surfaces  that  are  fit  to 
a  particular  class  of  pattern  classification  problems. 
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